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Preface 


The  success  of  the  open-source  statistical  software  “7?”  has  made  a  significant 
impact  on  the  teaching  and  research  of  statistics  in  the  last  decade.  Analysing  data  is 
now  easier  and  more  affordable  than  ever,  but  choosing  the  most  appropriate  sta¬ 
tistical  methods  remains  a  challenge  for  many  users.  To  understand  and  interpret 
software  output,  it  is  necessary  to  engage  with  the  fundamentals  of  statistics. 

However,  many  readers  do  not  feel  comfortable  with  complicated  mathematics. 
In  this  book,  we  attempt  to  find  a  healthy  balance  between  explaining  statistical 
concepts  comprehensively  and  showing  their  application  and  interpretation  using  R. 

This  book  will  benefit  beginners  and  self-learners  from  various  backgrounds  as 
we  complement  each  chapter  with  various  exercises  and  detailed  and  comprehen¬ 
sible  solutions.  The  results  involving  mathematics  and  rigorous  proofs  are  separated 
from  the  main  text,  where  possible,  and  are  kept  in  an  appendix  for  interested 
readers.  Our  textbook  covers  material  that  is  generally  taught  in  introductory-level 
statistics  courses  to  students  from  various  backgrounds,  including  sociology, 
biology,  economics,  psychology,  medicine,  and  others.  Most  often,  we  introduce 
the  statistical  concepts  using  examples  and  illustrate  the  calculations  both  manually 
and  using  R. 

However,  while  we  provide  a  gentle  introduction  to  R  (in  the  appendix),  this  is 
not  a  software  book.  Our  emphasis  lies  on  explaining  statistical  concepts  correctly 
and  comprehensively,  using  exercises  and  software  to  delve  deeper  into  the  subject 
matter  and  learn  about  the  conceptual  challenges  that  the  methods  present. 

This  book’s  homepage,  http://chris.userweb.mwn.de/book/,  contains  additional 
material,  most  notably  the  software  codes  needed  to  answer  the  software  exercises, 
and  data  sets.  In  the  remainder  of  this  book,  we  will  use  grey  boxes 

7?-coirunand  ( ) 

to  introduce  the  relevant  R  commands.  In  many  cases,  the  code  can  be  directly 
pasted  into  R  to  reproduce  the  results  and  graphs  presented  in  the  book;  in  others, 
the  code  is  abbreviated  to  improve  readability  and  clarity,  and  the  detailed  code  can 
be  found  online. 


v 
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Part  I 

Descriptive  Statistics 


Introduction  and  Framework 


Statistics  is  a  collection  of  methods  which  help  us  to  describe,  summarize,  interpret, 
and  analyse  data.  Drawing  conclusions  from  data  is  vital  in  research,  administra¬ 
tion,  and  business.  Researchers  are  interested  in  understanding  whether  a  medical 
intervention  helps  in  reducing  the  burden  of  a  disease,  how  personality  relates  to 
decision-making,  whether  a  new  fertilizer  increases  the  yield  of  crops,  how  a  polit¬ 
ical  system  affects  trade  policy,  who  is  going  to  vote  for  a  political  party  in  the 
next  election,  what  are  the  long-term  changes  in  the  population  of  a  fish  species, 
and  many  more  questions.  Governments  and  organizations  may  be  interested  in  the 
life  expectancy  of  a  population,  the  risk  factors  for  infant  mortality,  geographical 
differences  in  energy  usage,  migration  patterns,  or  reasons  for  unemployment.  In 
business,  identifying  people  who  may  be  interested  in  a  certain  product,  optimizing 
prices,  and  evaluating  the  satisfaction  of  customers  are  possible  areas  of  interest. 

No  matter  what  the  question  of  interest  is,  it  is  important  to  collect  data  in  a 
way  which  allows  its  analysis.  The  representation  of  collected  data  in  a  data  set  or 
data  matrix  allows  the  application  of  a  variety  of  statistical  methods.  In  the  first 
part  of  the  book,  we  are  going  to  introduce  methods  which  help  us  in  describing 
data,  and  the  second  and  third  parts  of  the  book  focus  on  inferential  statistics,  which 
means  drawing  conclusions  from  data.  In  this  chapter,  we  are  going  to  introduce  the 
framework  of  statistics  which  is  needed  to  properly  collect,  administer,  evaluate,  and 
analyse  data. 


1 .1  Population,  Sample,  and  Observations 

Let  us  first  introduce  some  terminology  and  related  notations  used  in  this  book. 
The  units  on  which  we  measure  data — such  as  persons,  cars,  animals,  or  plants — 
are  called  observations.  These  units/observations  are  represented  by  the  Greek 
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1  Introduction  and  Framework 


symbol  uj.  The  collection  of  all  units  is  called  population  and  is  represented  by  Q. 
When  we  refer  t ouj  e  £2 ,  we  mean  a  single  unit  out  of  all  units,  e.g.  one  person  out  of 
all  persons  of  interest.  If  we  consider  a  selection  of  observations  uj\ ,  UJ2, . . . ,  uJn ,  then 
these  observations  are  called  sample.  A  sample  is  always  a  subset  of  the  population, 

\pJ\ ,  Ld2 ,  .  .  .  ,  UJn }  c  £2 . 

Example  1.1.1 

•  If  we  are  interested  in  the  social  conditions  under  which  Indian  people  live,  then 
we  would  define  all  inhabitants  of  India  as  Q  and  each  of  its  inhabitants  as  uj.  If  we 
want  to  collect  data  from  a  few  inhabitants,  then  those  would  represent  a  sample 
from  the  total  population. 

•  Investigating  the  economic  power  of  Africa’s  platinum  industry  would  require  to 
treat  each  platinum-related  company  as  uj,  whereas  all  platinum-related  companies 
would  be  collected  in  Q.  A  few  companies  uj\,  UJ2, . . . ,  uju  comprise  a  sample  of 
all  companies. 

•  We  may  be  interested  in  collecting  information  about  those  participating  in  a 
statistics  course.  All  participants  in  the  course  constitute  the  population  Q,  and 
each  participant  refers  to  a  unit  or  observation  uj. 

Remark  1.1.1  Sometimes,  the  concept  of  a  population  is  not  applicable  or  difficult 
to  imagine.  As  an  example,  imagine  that  we  measure  the  temperature  in  New  Delhi 
every  hour.  A  sample  would  then  be  the  time  series  of  temperatures  in  a  specific 
time  window,  for  example  from  January  to  March  2016.  A  population  in  the  sense  of 
observational  units  does  not  exist  here.  But  now  assume  that  we  measure  temperatures 
in  several  different  cities;  then,  all  the  cities  form  the  population,  and  a  sample  is  any 
subset  of  the  cities. 


1.2  Variables 

If  we  have  specified  the  population  of  interest  for  a  specific  research  question,  we 
can  think  of  what  is  of  interest  about  our  observations.  A  particular  feature  of  these 
observations  can  be  collected  in  a  statistical  variable  X.  Any  information  we  are 
interested  in  may  be  captured  in  such  a  variable.  For  example,  if  our  observations 
refer  to  human  beings,  X  may  describe  marital  status,  gender,  age,  or  anything  else 
which  may  relate  to  a  person.  Of  course,  we  can  be  interested  in  many  different 
features,  each  of  them  collected  in  a  different  variable  Xi ,  i  =  1,2,...,/?.  Each 
observation  uj  takes  a  particular  value  for  X.  If  X  refers  to  gender,  each  observation, 
i.e.  each  person,  has  a  particular  value  v  which  refers  to  either  “male”  or  “female”. 

The  formal  definition  of  a  variable  is 


uj  i->  x 


(1.1) 
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This  definition  states  that  a  variable  X  takes  a  value  v  for  each  observation  oo  e  £2, 

whereby  the  number  of  possible  values  is  contained  in  the  set  S. 

Example  1.2.1 

•  If  X  refers  to  gender,  possible  v-values  are  contained  in  S  =  {male,  female}.  Each 
observation  uo  is  either  male  or  female,  and  this  information  is  summarized  in  X. 

•  Let  X  be  the  country  of  origin  for  a  car.  Possible  values  to  be  taken  by  an  observation 
l o  (i.e.  a  car)  are  S  =  {Italy,  South  Korea,  Germany,  France,  India,  China,  Japan, 
USA,  . . .}. 

•  A  variable  X  which  refers  to  age  may  take  any  value  between  1  and  125.  Each 
person  cj  is  assigned  a  value  v  which  represents  the  age  of  this  person. 


1 .2.1  Qualitative  and  Quantitative  Variables 

Qualitative  variables  are  the  variables  which  take  values  v  that  cannot  be  ordered  in 
a  logical  or  natural  way.  For  example, 

•  the  colour  of  the  eye, 

•  the  name  of  a  political  party,  and 

•  the  type  of  transport  used  to  travel  to  work 

are  all  qualitative  variables.  Neither  is  there  any  reason  to  list  blue  eyes  before  brown 
eyes  (or  vice  versa)  nor  does  it  make  sense  to  list  buses  before  trains  (or  vice  versa). 

Quantitative  variables  represent  measurable  quantities.  The  values  which  these 
variables  can  take  can  be  ordered  in  a  logical  and  natural  way.  Examples  of  quanti¬ 
tative  variables  are 

•  size  of  shoes, 

•  price  for  houses, 

•  number  of  semesters  studied,  and 

•  weight  of  a  person. 


Remark  1.2.1  It  is  common  to  assign  numbers  to  qualitative  variables  for  practical 
purposes  in  data  analyses  (see  Sect.  1.4  for  more  detail).  For  instance,  if  we  consider 
the  variable  “gender”,  then  each  observation  can  take  either  the  “value”  male  or 
female.  We  may  decide  to  assign  1  to  female  and  0  to  male  and  use  these  numbers 
instead  of  the  original  categories.  However,  this  is  arbitrary,  and  we  could  have  also 
chosen  “1”  for  male  and  “0”  for  female,  or  “2”  for  male  and  “10”  for  female.  There 
is  no  logical  and  natural  order  on  how  to  arrange  male  and  female,  and  thus,  the 
variable  gender  remains  a  qualitative  variable,  even  after  using  numbers  for  coding 
the  values  that  X  can  take. 
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1 .2.2  Discrete  and  Continuous  Variables 

Discrete  variables  are  variables  which  can  only  take  a  finite  number  of  values. 
All  qualitative  variables  are  discrete,  such  as  the  colour  of  the  eye  or  the  region  of 
a  country.  But  also  quantitative  variables  can  be  discrete:  the  size  of  shoes  or  the 
number  of  semesters  studied  would  be  discrete  because  the  number  of  values  these 
variables  can  take  is  limited. 

Variables  which  can  take  an  infinite  number  of  values  are  called  continuous 
variables.  Examples  are  the  time  it  takes  to  travel  to  university,  the  length  of  an 
antelope,  and  the  distance  between  two  planets.  Sometimes,  it  is  said  that  continuous 
variables  are  variables  which  are  “measured  rather  than  counted”.  This  is  a  rather 
informal  definition  which  helps  to  understand  the  difference  between  discrete  and 
continuous  variables.  The  crucial  point  is  that  continuous  variables  can,  in  theory, 
take  an  infinite  number  of  values;  for  instance,  the  height  of  a  person  may  be  recorded 
as  172  cm.  However,  the  actual  height  on  the  measuring  tape  might  be  172.3  cm  which 
was  rounded  off  to  172  cm.  If  one  had  a  better  measuring  instrument,  we  may  have 
obtained  172.342  cm.  But  the  real  height  of  this  person  is  a  number  with  indefinitely 
many  decimal  places  such  as  172.342975328... cm.  No  matter  what  we  eventually 
report  or  obtain,  a  variable  which  can  take  an  infinite  amount  of  values  is  defined  to 
be  a  continuous  variable. 


1.2.3  Scales 

The  thoughts  and  considerations  from  above  indicate  that  different  variables  contain 
different  amounts  of  information.  A  useful  classification  of  these  considerations  is 
given  by  the  concept  of  the  scale  of  a  variable.  This  concept  will  help  us  in  the 
remainder  of  this  book  to  identify  which  methods  are  the  appropriate  ones  to  use  in 
a  particular  setting. 

Nominal  scale.  The  values  of  a  nominal  variable  cannot  be  ordered.  Examples  are 
the  gender  of  a  person  (male-female)  or  the  status  of  an  application  (pending-not 
pending). 

Ordinal  scale.  The  values  of  an  ordinal  variable  can  be  ordered.  However,  the  differ¬ 
ences  between  these  values  cannot  be  interpreted  in  a  meaningful  way.  For  exam¬ 
ple,  the  possible  values  of  education  level  (none-primary  education-secondary 
education-university  degree)  can  be  ordered  meaningfully,  but  the  differences 
between  these  values  cannot  be  interpreted.  Likewise,  the  satisfaction  with  a  prod¬ 
uct  (unsatisfied-satisfied-very  satisfied)  is  an  ordinal  variable  because  the  values 
this  variable  can  take  can  be  ordered,  but  the  differences  between  “unsatisfied- 
satisfied”  and  “satisfied-very  satisfied”  cannot  be  compared  in  a  numerical  way. 
Continuous  scale.  The  values  of  a  continuous  variable  can  be  ordered.  Furthermore, 
the  differences  between  these  values  can  be  interpreted  in  a  meaningful  way.  For 
instance,  the  height  of  a  person  refers  to  a  continuous  variable  because  the  values 
can  be  ordered  (170  cm,  171  cm,  172  cm,  ...),  and  differences  between  these 
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values  can  be  compared  (the  difference  between  170  and  171cm  is  the  same 
as  the  difference  between  171  and  172  cm).  Sometimes,  the  continuous  scale  is 
divided  further  into  subscales.  While  in  the  remainder  of  the  book  we  typically 
do  not  need  these  classifications,  it  is  still  useful  to  reflect  on  them: 

Interval  scale.  Only  differences  between  values,  but  not  ratios,  can  be  interpreted. 
An  example  for  this  scale  would  be  temperature  (measured  in  °C):  the  difference 
between  —  2  °C  and  4  °C  is  6  °C,  but  the  ratio  of  4/  —  2  =  —2  does  not  mean  that 
—4  °C  is  twice  as  cold  as  2  °C. 

Ratio  scale.  Both  differences  and  ratios  can  be  interpreted.  An  example  is  speed: 
60  km/h  is  40  km/h  more  than  20  km/h.  Moreover,  60  km/h  is  three  times  faster 
than  20  km/h  because  the  ratio  between  them  is  3. 

Absolute  scale.  The  absolute  scale  is  the  same  as  the  ratio  scale,  with  the  excep¬ 
tion  that  the  values  are  measured  in  “natural”  units.  An  example  is  “number  of 
semesters  studied”  where  no  artificial  unit  such  as  km/h  or  °C  is  needed:  the 
values  are  simply  1,  2,  3,  . . .. 


1 .2.4  Grouped  Data 

Sometimes,  data  may  be  available  only  in  a  summarized  form:  instead  of  the  original 
value,  one  may  only  know  the  category  or  group  the  value  belongs  to.  For  example, 

•  it  is  often  convenient  in  a  survey  to  ask  for  the  income  (per  year)  by  means  of 
groups:  [€0-€20,000),  [€20,000-€30,000),  . . .,  >  =€100,000; 

•  if  there  are  many  political  parties  in  an  election,  those  with  a  low  number  of  voters 
are  often  summarized  in  a  new  category  “Other  Parties”; 

•  instead  of  capturing  the  number  of  claims  made  by  an  insurance  company  customer, 
the  variable  “claimed”  may  denote  whether  or  not  the  customer  claimed  at  all 
(yes-no). 

If  data  is  available  in  grouped  form,  we  call  the  respective  variable  capturing 
this  information  a  grouped  variable.  Sometimes,  these  variables  are  also  known  as 
categorical  variables.  This  is,  however,  not  a  complete  definition  because  categorical 
variables  refer  to  any  type  of  variable  which  takes  a  finite,  possibly  small,  number  of 
values.  Thus,  any  discrete  and/or  nominal  and/or  ordinal  and/or  qualitative  variable 
may  be  regarded  as  a  categorical  variable.  Any  grouped  or  categorical  variable  which 
can  only  take  two  values  is  called  a  binary  variable. 

To  gain  a  better  understanding  on  how  the  definitions  from  the  above  sections 
relate  to  each  other  see  Fig.  1.1.  Qualitative  data  is  always  discrete,  but  quantitative 
data  can  be  both  discrete  (e.g.  size  of  shoes  or  a  grouped  variable)  and  continuous 
(e.g.  temperature).  Nominal  variables  are  always  qualitative  and  discrete  (e.g.  colour 
of  the  eye),  whereas  continuous  variables  are  always  quantitative  (e.g.  temperature). 
Categorical  variables  can  be  both  qualitative  (e.g.  colour  of  the  eye)  and  quantitative 
(satisfaction  level  on  a  scale  from  1  to  5).  Categorical  variables  are  never  continuous. 
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interval 

nominal  ordinal  continuous 


Fig.  1 .1  Summary  of  variable  classifications 


1.3  Data  Collection 

When  collecting  data,  we  may  ask  ourselves  how  to  facilitate  this  in  detail  and 
how  much  data  needs  to  be  collected.  The  latter  question  will  be  partly  answered 
in  Sect.  9.5;  but  in  general,  we  can  think  of  collecting  data  either  on  all  subjects  of 
interest,  such  as  in  a  national  census,  or  on  a  representative  sample  of  the  population. 
Most  commonly,  we  gather  data  on  a  sample  (described  in  the  Part  I  of  this  book)  and 
then  draw  conclusions  about  the  population  of  interest  (discussed  in  the  Part  III  of 
this  book).  A  sample  might  either  be  chosen  by  us  or  obtained  through  third  parties 
(hospitals,  government  agencies),  or  created  during  an  experiment.  This  depends  on 
the  context  as  described  below. 

Survey.  A  survey  typically  (but  not  always)  collects  data  by  asking  questions  (in 
person  or  by  phone)  or  providing  questionnaires  to  study  participants  (as  a  printout 
or  online).  For  example,  an  opinion  poll  before  a  national  election  provides  evidence 
about  the  future  government:  potential  voters  are  asked  by  phone  which  party  they  are 
going  to  vote  for  in  the  next  election;  on  the  day  of  the  election,  this  information  can 
be  updated  by  asking  the  same  question  to  a  sample  of  voters  who  have  just  delivered 
their  vote  at  the  polling  station  (so-called  exit  poll).  A  behavioural  research  survey 
may  ask  members  of  a  community  about  their  knowledge  and  attitudes  towards  drug 
use.  For  this  purpose,  the  study  coordinators  can  send  people  with  a  questionnaire 
to  this  community  and  interview  members  of  randomly  selected  households. 

Ideally,  a  survey  is  conducted  in  a  way  which  makes  the  chosen  sample  repre¬ 
sentative  of  the  population  of  interest.  If  a  marketing  company  interviews  people  in 
a  pedestrian  zone  to  find  their  views  about  a  new  chocolate  bar,  then  these  people 
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may  not  be  representative  of  those  who  will  potentially  be  interested  in  this  product. 
Similarly,  if  students  are  asked  to  fill  in  an  online  survey  to  evaluate  a  lecture,  it 
may  turn  out  that  those  who  participate  are  on  average  less  satisfied  than  those  who 
do  not.  Survey  sampling  is  a  complex  topic  on  its  own.  The  interested  reader  may 
consult  Groves  et  al.  (2009)  or  Kauermann  and  Kiichenhoff  (2011). 

Experiment.  Experimental  data  is  obtained  in  “controlled”  settings.  This  can  mean 
many  things,  but  essentially  it  is  data  which  is  generated  by  the  researcher  with  full 
control  over  one  or  many  variables  of  interest.  For  instance,  suppose  there  are  two 
competing  toothpastes,  both  of  which  promise  to  reduce  pain  for  people  with  sensitive 
teeth.  If  the  researcher  decided  to  randomly  assign  toothpaste  A  to  half  of  the  study 
participants,  and  toothpaste  B  to  the  other  half,  then  this  is  an  experiment  because 
it  is  only  the  researcher  who  decides  which  toothpaste  is  to  be  used  by  any  of  the 
participants.  It  is  not  decided  by  the  participant.  The  data  of  the  variable  toothpaste 
is  controlled  by  the  experimenter.  Consider  another  example  where  the  production 
process  of  a  product  can  potentially  be  reduced  by  combining  two  processes.  The 
management  could  decide  to  implement  the  new  process  in  three  production  facilities, 
but  leave  it  as  it  is  in  the  other  facilities.  The  production  process  for  the  different 
units  (facilities)  is  therefore  under  control  of  the  management.  However,  if  each 
facility  could  decide  for  themselves  if  they  wanted  a  change  or  not,  it  would  not  be 
an  experiment  because  factors  not  directly  controlled  by  the  management,  such  as  the 
leadership  style  of  the  facility  manager,  would  determine  which  process  is  chosen. 

Observational  Data.  Observational  data  is  data  which  is  collected  routinely,  without 
a  researcher  designing  a  survey  or  conducting  an  experiment.  Suppose  a  blood  sample 
is  drawn  from  each  patient  with  a  particular  acute  infection  when  they  arrive  at  a 
hospital.  This  data  may  be  stored  in  the  hospital’s  folders  and  later  accessed  by  a 
researcher  who  is  interested  in  studying  this  infection.  Or  suppose  a  government 
institution  monitors  where  people  live  and  move  to.  This  data  can  later  be  used  to 
explore  migration  patterns. 

Primary  and  Secondary  Data.  Primary  data  is  data  we  collect  ourselves,  i.e.  via  a 
survey  or  experiment.  Secondary  data,  in  contrast,  is  collected  by  someone  else.  For 
example,  data  from  a  national  census,  publicly  available  databases,  previous  research 
studies,  government  reports,  historical  data,  and  data  from  the  internet,  among  others, 
are  secondary  data. 


1 .4  Creating  a  Data  Set 

There  is  a  unique  way  in  which  data  is  prepared  and  collected  to  utilize  statistical 
analyses.  The  data  is  stored  in  a  data  matrix  (=data  set)  with  p  columns  and  n  rows 
(see  Fig.  1.2).  Each  row  corresponds  to  an  observation/unit  uj  and  each  column  to 
a  variable  X.  This  means  that,  for  example,  the  entry  in  the  fourth  row  and  second 
column  (*42 )  describes  the  value  of  the  fourth  observation  on  the  second  variable. 
The  examples  below  will  illustrate  the  concept  of  a  data  set  in  more  detail. 
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UJ 

Variable  1 

Variable  2 

Variable  p 

(1 

Xu 

X12 

X\ p  ^ 

2 

X21 

X22 

X2  p 

y  n 

Xnl 

Xn2 

Xnp  J 

Fig.  1.2  Data  set  or  data  matrix 
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77 

82 
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73 

93 

68 

Student  D 

88 

71 

63 

58 

Student  E 

75 

83 

63 

57 

\ 
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Fig. 1.3  Data  set  of  marks  of  five  students 


Example  1.4. 1  Suppose  five  students  take  examinations  in  music,  mathematics,  biol¬ 
ogy,  and  geography.  Their  marks,  measured  on  a  scale  between  0  and  100  (where 
100  is  the  best  mark),  can  be  written  down  as  illustrated  in  Fig.  1.3.  Note  that  each 
row  refers  to  a  student  and  each  column  to  a  variable.  We  consider  a  larger  data  set 
in  the  next  example. 


Example  1.4.2  Consider  the  data  set  described  in  Appendix  A.4.  A  pizza  delivery 
service  captures  information  related  to  each  delivery,  for  example  the  delivery  time, 
the  temperature  of  the  pizza,  the  name  of  the  driver,  the  date  of  the  delivery,  the 
name  of  the  branch,  and  many  more.  To  capture  the  data  of  all  deliveries  during  one 
month,  we  create  a  data  matrix.  Each  row  refers  to  a  particular  delivery,  therefore 
representing  the  observations  of  the  data.  Each  column  refers  to  a  variable.  In  Fig.  1 .4, 
the  variables  X\  (delivery  time  in  minutes),  X 2  (temperature  in  °C),  and  X\2  (name 
of  branch)  are  listed. 


Delivery 

Delivery  Time 

Temperature 

Branch 

1 

35.1 

68.3 

East  (1) 

2 

25.2 

71.0 

East  (1) 

1266 

35.7 

60.8 

West  (2) 

Fig.  1.4  Pizza  data  set 
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Table  1 .1  Coding  list  for 
branch 


Variable 

Values 

Code 

Branch 

East 

1 

West 

2 

Centre 

3 

Missing 

4 

The  first  row  tells  us  about  the  features  of  the  first  pizza  delivery:  the  delivery 
time  was  35.1  min,  the  pizza  arrived  with  a  temperature  of  68.3  °C,  and  the  pizza 
was  delivered  from  the  branch  in  the  East  of  the  city.  In  total,  there  were  n  =  1266 
deliveries.  For  nominal  variables,  such  as  branch,  we  may  decide  to  produce  a  coding 
list,  as  illustrated  in  Table  1.1:  instead  of  referring  to  the  branches  as  “East”,  “West”, 
and  “Centre”,  we  may  simply  call  them  1,  2,  and  3.  As  we  will  see  in  Chap.  11,  this 
has  benefits  for  some  analysis  methods,  though  this  is  not  needed  in  general. 

If  some  values  are  missing,  for  example  because  they  were  never  captured  or  even 
lost,  then  this  requires  special  attention.  In  Table  1.1,  we  assign  missing  values  the 
number  “4”  and  therefore  treat  them  as  a  separate  category.  If  we  work  with  statistical 
software  (see  below),  we  may  need  other  coding  such  as  NA  in  the  statistical  software 
R  or  in  Stata.  More  detail  can  be  found  in  Appendix  A. 

Another  consideration  when  collecting  data  is  that  of  transformations:  we  may 
have  captured  the  velocity  of  cars  in  kilometres  per  hour,  but  may  need  to  present 
the  data  in  miles  per  hour;  we  have  captured  the  temperature  in  degrees  Celsius, 
whereas  we  need  to  communicate  results  in  degrees  Fahrenheit,  or  we  have  created  a 
satisfaction  score  which  we  want  to  range  from  —5  to  +5,  while  the  score  currently 
runs  from  0  to  20.  This  is  not  a  problem  at  all.  We  can  simply  create  a  new  variable 
which  reflects  the  required  transformation.  However,  valid  transformations  depend 
on  the  scale  of  a  variable.  Variables  on  an  interval  scale  can  use  transformations  of 
the  following  kind: 

g(x)=a  +  bx,  b  >  0.  (E2) 

For  ratio  scales,  only  the  following  transformations  are  valid: 

g(x )  =  bx ,  b  >  0.  (1-3) 

In  the  above  equation,  a  is  set  to  0  because  ratios  only  stay  the  same  if  we  respect  a 
variable’s  natural  point  of  origin. 

Example  1.4.3  The  temperature  in  °F  relates  to  the  temperature  in  °C  as  follows: 

Temperature  in°F  =  32  +  1.8  Temperature  in  °C 

g(v)  =  a  +  b  x 

This  means  that  25  °C  relates  to  (32  +  1.8  •  25)  °F  =  77  °F.  If  X\  is  a  variable 
representing  temperature  by  °C,  we  can  simply  create  a  new  variable  X 2  which  is 
temperature  in  °F.  Since  temperature  is  measured  on  an  interval  scale,  this  transfor¬ 
mation  is  valid. 
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Changing  currencies  is  also  possible.  If  we  would  like  to  represent  the  price  of  a 
product  not  in  South  African  Rand  but  in  €,  we  simply  apply  the  transformation 

Price  in  South  African  Rand  =  b  •  Price  in  € 

whereby  b  is  the  currency  exchange  rate. 


1 .4.1  Statistical  Software 

There  are  number  of  statistical  software  packages  which  allow  data  collection,  man¬ 
agement,  and-most  importantly-analysis.  In  this  book,  we  focus  on  the  statistical 
software  R  which  is  freely  available  at  http://cran.r-project.org/.  A  gentle  introduc¬ 
tion  to  R  is  provided  in  Appendix  A.  A  data  matrix  can  be  created  manually  using 
commands  such  as  matrix () ,  data,  frame  () ,  and  others.  Any  data  can  be  edited 
using  edit  () .  However,  typically  analysts  have  already  typed  their  data  into  data¬ 
bases  or  spreadsheets,  for  example  in  Excel,  Access,  or  MySQL.  In  most  of  these 
applications,  it  is  possible  to  save  the  data  as  an  ASCII  file  (. dat ),  as  a  tab-delimited 
file  (.txt),  or  as  a  comma- separated  values  file  (.csv).  All  of  these  formats  allow  easy 
switching  between  different  software  and  database  applications.  Such  data  can  easily 
be  read  into  R  by  means  of  the  following  commands: 

setwd( ' C : /directory ' ) 
read . table ( ' pizza_delivery . dat ' ) 
read . table ( ' pizza_delivery . txt ' ) 
read . csv ( ' pizza_delivery . csv ' ) 

where  setwd  specifies  the  working  directory.  Alternatively,  loading  the  library 
foreign  allows  the  import  of  data  from  many  different  statistical  software  pack¬ 
ages,  notably  Stata,  SAS,  Minitab,  SPSS,  among  others.  A  detailed  description  of 
data  import  and  export  can  be  found  in  the  respective  R  manual  available  at  http:// 
cran.r-project.org/doc/manuals/r-release/R-data.pdf.  Once  the  data  is  read  into  R , 
it  can  be  viewed  with 


fix() 

#  option  1 

R 

ViewO 

#  option  2 

We  can  also  can  get  an  overview  of  the  data  directly  in  the  R  -console  by  displaying 
only  the  top  lines  of  the  data  with  head  ( ) .  Both  approaches  are  visualized  in  Fig.  1 .5 
for  the  pizza  data  introduced  in  Example  1.4.2. 


1.5  Key  Points  and  Further  Issues 
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>  pizza  <-  read, csv (nprzza_delivery . csv") 

>  head  [pizza) 


day 

date 

time 

operator 

branch 

driver 

temperature 

bill 

1 

Thursday 

01' -May- 14 

35  *  12337 

Laura 

East 

Bruno 

68 .23772 

58.3 

2 

Thursday 

01 -May-13 

25.20307 

Melissa 

East 

Salvatore 

70.99779 

26,4 

3 

Thursday 

01 -May- 13 

35.63340 

Melissa 

West 

Salvatore 

53.39415 

58.1 

4 

Thursday 

01 -May- 13 

29.37330 

Melissa 

East 

Salvatore 

70.30660 

35.2 

5 

Thursday 

01 -May-13 

29.99361 

Melissa 

West 

Salvatore 

71.50169 

38,3 

6 

Thursday 

01 -May- 13 

40.25332 

Melissa 

Centre 

Bruno 

60.75950 

61.8 

>  fix (pizza) 


File  Windows  Edit  Help 


day 

data 

time 

operator 

branch 

driver 

temperature 

bill 

1 

Thursday 

Ol-May-14 

35.12837 

Laura 

East 

Bruno 

66.28772 

58.4 

2 

Thursday 

01-May-14 

25.20307 

Melissa 

East 

Salvatore 

70. 99779 

26.4 

3 

Thursday 

Ol-May-14 

45. 6434 

Melissa 

West 

Salvatore 

53.39415 

sa.i 

a 

Thursday 

0 1 “May- 1 3 

29.3743 

Melissa 

East 

Salvatore 

70.3066 

35.2 

5 

Thursday 

01 -May- 14 

29.99461 

Melissa 

West 

Salvatore 

71.50169 

35.4 

6 

Thursday 

01-May- 14 

40.25432 

Melissa 

Centre 

Bruno 

60.7595 

61.8 

7 

Thursday 

0 1 -May- 1 4 

48.72861 

Laura 

West 

Bruno 

53.2587 

57 . 9 

8 

Thursday 

Ol-May-14 

34.02772 

Melissa 

West 

Mario 

68.12793 

35.8 

9 

Thursday 

01 -May- 14 

28. 20943 

Laura 

Centre 

Mario 

64 . 94661 

36. 6 

10 

Thursday 

01 “May- 14 

37.95479 

Melissa 

Centre 

Bruno 

60.0073S 

44 .8 

11 

Thursday 

Ol-May-14 

42.07956 

Melissa 

Centre 

Bruno 

65.53872 

49.5 

Fig.  1 .5  Viewing  data  in  R 


1 .5  Key  Points  and  Further  Issues 


Note: 


/  The  scale  of  variables  is  not  only  a  formalism  but  an  essential  framework 
for  choosing  the  correct  analysis  methods.  This  is  particularly  relevant 
for  association  analysis  (Chap.  4),  statistical  tests  (Chap.  10),  and  linear 
regression  (Chap.  11). 

/  Even  if  variables  are  measured  on  a  nominal  scale  (i.e.  if  they  are  cate¬ 
gorical/qualitative),  we  may  choose  to  assign  a  number  to  each  category 
of  this  variable.  This  eases  the  implementation  of  some  analysis  methods 
introduced  later  in  this  book. 

/  Data  is  usually  stored  in  a  data  matrix  where  the  rows  represent  the 
observations  and  the  columns  are  variables.  It  can  be  analysed  with 
statistical  software.  We  use  R  (R  Core  Team  2016)  in  this  book.  A 
gentle  introduction  is  provided  in  Appendix  A  and  throughout  the  book. 
A  more  comprehensive  introduction  can  be  found  in  other  books,  for 
example  in  Albert  and  Rizzo  (2012),  Crawley  (2013),  or  Ligges  (2008). 
Even  advanced  books,  e.g.  Adler  (2012)  or  Everitt  and  Hothorn  (2011), 
can  offer  insights  to  beginners. 
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1  Introduction  and  Framework 


1 .6  Exercises 

Exercise  1.1  Describe  both  the  population  and  the  observations  for  the  following 
research  questions: 

(a)  Evaluation  of  the  satisfaction  of  employees  from  an  airline. 

(b)  Description  of  the  marks  of  students  from  an  assignment. 

(c)  Comparison  of  two  drugs  which  deal  with  high  blood  pressure. 

Exercise  1.2  A  national  park  conducts  a  study  on  the  behaviour  of  their  leopards. 
A  few  of  the  park’s  leopards  are  registered  and  receive  a  GPS  device  which  allows 
measuring  the  position  of  the  leopard.  Use  this  example  to  describe  the  following 
concepts:  population,  sample,  observation,  value,  and  variable. 

Exercise  1.3  Which  of  the  following  variables  are  qualitative,  and  which  are  quan¬ 
titative?  Specify  which  of  the  quantitative  variables  are  discrete  and  which  are 
continuous: 

Time  to  travel  to  work,  shoe  size,  preferred  political  party,  price  for  a  canteen  meal,  eye 
colour,  gender,  wavelength  of  light,  customer  satisfaction  on  a  scale  from  1  to  10,  delivery 
time  for  a  parcel,  blood  type,  number  of  goals  in  a  hockey  match,  height  of  a  child,  subject 
line  of  an  email. 

Exercise  1.4  Identify  the  scale  of  the  following  variables: 

(a)  Political  party  voted  for  in  an  election 

(b)  The  difficulty  of  different  levels  in  a  computer  game 

(c)  Production  time  of  a  car 

(d)  Age  of  turtles 

(e)  Calender  year 

(f)  Price  of  a  chocolate  bar 

(g)  Identification  number  of  a  student 

(h)  Final  ranking  at  a  beauty  contest 

(i)  Intelligence  quotient. 

Exercise  1.5  Make  yourself  familiar  with  the  pizza  data  set  from  Appendix  A.4. 

(a)  First,  browse  through  the  introduction  to  R  in  Appendix  A.  Then,  read  in  the 
data. 

(b)  View  the  data  both  in  the  R  data  editor  and  in  the  R  console. 

(c)  Create  a  new  data  matrix  which  consists  of  the  first  5  rows  and  first  5  variables 
of  the  data.  Print  this  data  set  on  the  R  console.  Now,  save  this  data  set  in  your 
preferred  format. 

(d)  Add  a  new  variable  “NewTemperature”  to  the  data  set  which  converts  the  tem¬ 
perature  from  °C  to  °F. 


1 .6  Exercises 
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(e)  Attach  the  data  and  list  the  values  from  the  variable  “NewTemperature”. 

(f)  Use  “?”  to  make  yourself  familiar  with  the  following  commands:  str,  dim, 
colnames,  names,  nrow,  ncol,  head,  and  tail.  Apply  these  commands 
to  the  data  to  get  more  information  about  it. 

Exercise  1 . 6  Consider  the  research  questions  of  describing  parents  ’  attitudes  towards 

immunization,  what  proportion  of  them  wants  immunization  against  chicken  pox  for 

their  last-born  child,  and  whether  this  proportion  differs  by  gender  and  age. 

(a)  Which  data  collection  method  is  the  most  suitable  one  to  answer  the  above 
questions:  survey  or  experiment? 

(b)  How  would  you  capture  the  attitudes  towards  immunization  in  a  single  variable? 

(c)  Which  variables  are  needed  to  answer  all  the  above  questions?  Describe  the  scale 
of  each  of  them. 

(d)  Reflect  on  what  an  appropriate  data  set  would  look  like.  Now,  given  this  data 
set,  try  to  write  down  the  above  research  questions  as  precisely  as  possible. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  321 


Frequency  Measures  and  Graphical 
Representation  of  Data 


In  Chap.  1 ,  we  highlighted  that  different  variables  contain  different  levels  of  informa¬ 
tion.  When  summarizing  or  visualizing  one  or  more  variable(s),  it  is  this  information 
which  determines  the  appropriate  statistical  methods  to  use. 

Suppose  we  are  interested  in  studying  the  employment  opportunities  and  starting 
salaries  of  university  graduates  with  a  master’s  degree.  Let  the  variable  X  denote  the 
starting  salaries  measured  in  €/year.  Now  suppose  100  graduate  students  provide 
their  initial  salaries.  Let  us  write  down  the  salary  of  the  first  student  as  x\,  the 
salary  of  the  second  student  as  X2,  and  so  on.  We  therefore  have  100  observations 
x\ ,  X2,  . . . ,  jcioo-  How  can  we  summarize  those  100  values  best  to  extract  meaningful 
information  from  them?  The  answer  to  this  question  depends  upon  several  aspects 
like  the  nature  of  the  recorded  data,  e.g.  how  many  observations  have  been  obtained 
(either  small  in  number  or  large  in  number)  or  how  the  data  was  recorded  (either 
exact  values  were  obtained  or  the  values  were  obtained  in  intervals).  For  example,  the 
starting  salaries  may  be  obtained  as  exact  values,  say  51,500  €/year,  32,350  €/year, 
etc.  Alternatively,  these  values  could  have  been  summarized  in  categories  such  as  low 
income  (<30,000  €/year),  medium  income  (30,000-50,000  €/year),  high  income 
(50,000-70,000  €/year),  and  very  high  income  (>70,000  €/year).  Another  approach 
is  to  ask  whether  the  students  were  employed  or  not  after  graduating  and  record  the 
data  in  terms  of  “yes”  or  “no”.  It  is  evident  that  the  latter  classification  is  less  detailed 
than  the  grouped  income  data  which  is  less  detailed  than  the  exact  data.  Depending  on 
which  conceptualization  of  “starting  salary”  we  use,  we  need  to  choose  the  approach 
to  summarize  the  data,  that  is  the  100  values  relating  to  the  100  graduated  students. 


2.1  Absolute  and  Relative  Frequencies 

Discrete  Data.  Let  us  first  consider  a  simple  example  to  illustrate  our  notation. 


©  Springer  International  Publishing  Switzerland  2016 
C.  Heumann  et  al.,  Introduction  to  Statistics  and  Data  Analysis , 
DOI  10.1007/978-3-319-46162-5  2 
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2  Frequency  Measures  and  Graphical  Representation  of  Data 


Example  2.1.1  Suppose  there  are  ten  people  in  a  supermarket  queue.  Each  of  them 
is  either  coded  as  “F”  (if  the  person  is  female)  or  “M”  (if  the  person  is  male).  The 
collected  data  may  look  like 

M,  F,  M,  F,  M,  M,  M,  F,  M,  M. 

There  are  now  two  categories  in  the  data:  male  (M)  and  female  (F).  We  use  a\  to  refer 
to  the  male  category  and  <22  to  refer  to  the  female  category.  Since  there  are  seven  male 
and  three  female  students,  we  have  7  values  in  category  denoted  as  n\  = 7 ,  and  3 
values  in  category  <22,  denoted  as  ri2  =  3.  The  number  of  observations  in  a  particular 
category  is  called  the  absolute  frequency.  It  follows  that  n  \  —  1  and  112  =  3  are  the 
absolute  frequencies  of  a\  and  <22,  respectively.  Note  that  n\  +  ri2  =  n  =  10,  which 
is  the  same  as  the  total  number  of  collected  observations.  We  can  also  calculate 
the  relative  frequencies  of  <21  and  <22  as  /)  =  f(a\)  =  ^  ^  =  0.7  =  70%  and 

f2  =  f(a2)  =  ^  ^  =  0.3  =  30  %,  respectively.  This  gives  us  information  about 

the  proportions  of  male  and  female  customers  in  the  queue. 

We  now  extend  these  concepts  to  a  general  framework  for  the  summary  of  data 
on  discrete  variables.  Suppose  there  are  k  categories  denoted  as  <21,  <22, . . . ,  <2& 
with  rij  (j  =  1,2,...,  k)  observations  in  category  aj .  The  absolute  frequency  nj  is 
defined  as  the  number  of  units  in  the  j  th  category  a  j .  The  sum  of  absolute  frequencies 

equals  the  total  number  of  units  in  the  data:  Xj=i  nj  —  n-  The  relative  frequencies 
of  the  j  th  class  are  defined  as 

fj  =  f (a j)  =  — ,  7  =  1,2,  (2.1) 

n 

The  relative  frequencies  always  lie  between  0  and  1  and  Xy=i  fj  —  1* 

Grouped  Continuous  Data.  Data  on  continuous  variables  usually  has  a  large  number 
0 k )  of  different  values.  Sometimes  k  may  even  be  the  same  as  n  and  in  such  a  case 
the  relative  frequencies  become  fj  =  ^  for  all  j .  However,  it  is  possible  to  define 
intervals  in  which  the  observed  values  are  contained. 

Example  2.1.2  Consider  the  following  n  =  20  results  of  the  written  part  of  a  driving 
licence  examination  (a  maximum  of  100  points  could  be  achieved): 

28,  35,  42,  90,  70,  56,  75,  66,  30,  89,  75,  64,  81,  69,  55,  83,  72,  68,  73,  16. 

We  can  summarize  the  results  in  class  intervals  such  as  0-20,  21-40,  41-60,  61-80, 
and  81-100,  and  the  data  can  be  presented  as  follows: 


Class  intervals 

0-20 

21-40 

41-60 

61-80 

81-100 

Absolute  frequencies 

n\  =  1 

in  —  3 

n3  =3 

114  =  9 

n5  =4 

Relative  frequencies 

II 

oh 

/2=lj 

f  =  35 

> 

II 

II 

We  have  nj  =20  =  n  and  fj  —  1- 


2.1  Absolute  and  Relative  Frequencies 
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Table  2.1  Frequency  distribution  for  discrete  data 


Class  intervals  ( aj ) 

a\ 

<12 

... 

ak 

Absolute  frequencies  (nj) 

n\ 

ri2 

... 

nk 

Relative  frequencies  (fj) 

h 

h 

... 

fk 

Now,  suppose  the  n  observations  can  be  classified  into  k  class  intervals 
a\ ,  ci2,  . . . ,  tffc,  where  aj(j  =  1,2,  ,  k)  contains  nj  observations  with  nj  = 

n.  The  relative  frequency  of  the  yth  class  is  fj  =  nj /n  and  Xj=i  fj  =  1-  Table  2.1 
displays  the  frequency  distribution  of  a  discrete  variable  X. 


Example  2.1.3  Consider  the  pizza  delivery  service  data  (Example  1.4.2,  Appen¬ 
dix  A. 4).  We  are  interested  in  the  pizza  deliveries  by  branch  and  generate  the  respec¬ 
tive  frequency  table,  showing  the  distribution  of  the  data,  using  the  table  command 
in  R  (after  reading  in  and  attaching  the  data)  as 


table (branch)  #  absolute  frequencies 

table (branch) /length (branch)  #  relative  frequencies 


dj 

Centre 

East 

West 

nJ 

421 

410 

435 

fi 

1266  *  °-333 

1266  *  °-323 

1266  *  0-344 

We  have  n  =  ^  •  rij  =  1266  deliveries  and  ^  •  fj  =  1.  We  can  see  from  this  table 
that  each  branch  has  a  similar  absolute  number  of  pizza  deliveries  and  each  branch 
contributes  to  about  one-third  of  the  total  number  of  deliveries. 


2.2  Empirical  Cumulative  Distribution  Function 


Another  approach  to  summarize  and  visualize  the  (frequency)  distribution  of  vari¬ 
ables  is  the  empirical  cumulative  distribution  function,  often  abbreviated  as 
“ECDF”.  As  the  name  itself  suggests,  it  gives  us  an  idea  about  the  cumulative  rela¬ 
tive  frequencies  up  to  a  certain  point.  For  example,  say  we  want  to  know  how  many 
people  scored  up  to  60  points  in  Example  2.1.2.  Then,  this  can  be  calculated  by 
adding  the  number  of  people  in  the  class  intervals  0-20,  21-40,  and  41-60,  which 
corresponds  to  n\  +  ri2  +  =  1  +  3  +  3  =  7  and  is  the  cumulative  frequency.  If 

we  want  to  know  the  relative  frequency  of  people  obtaining  up  to  60  points,  we  have 
to  add  the  relative  frequencies  of  the  people  in  the  class  intervals  0-20,  21-40,  and 

41-60  as  /i  +  /2  +  h  —  2b  +  h  +  h  = 
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2  Frequency  Measures  and  Graphical  Representation  of  Data 


Before  discussing  the  empirical  cumulative  distribution  function  in  a  more  general 
framework,  let  us  first  understand  the  concept  of  ordered  values.  Suppose  the  values 
of  height  of  four  people  are  observed  as  x\  =  180  cm,  X2  =  160  cm,  V3  =  175  cm, 
and  X4  =  170  cm.  We  arrange  these  values  in  an  order,  say  ascending  order,  i.e.  first 
the  smallest  value  (denoted  as  jc(iy)  and  lastly  the  largest  value  (denoted  as  X(4)).  We 
obtain 


V(i)  =  X2  =  160  cm,  X(2)  =  X4  =  170  cm, 

x@)  =  X3  =  175  cm,  X(4)  —  x\  =  180  cm. 

The  values  X(\) ,  X(2) ,  -^(3) ,  and  x^  are  called  ordered  values  for  which  X(\)  <  X(2)  < 

X(3)  <  X(4)  holds.  Note  that  x\  is  not  necessarily  the  smallest  value  but  jt(i)  is 

necessarily  the  smallest  value.  In  general,  if  we  have  n  observations  x\,  X2,  . . . ,  xn, 
then  the  ordered  data  is  X(\)  <  X(2)  <  •  •  •  <  X(n). 

Consider  n  observations  x\,  *2,  . . . ,  xn  of  a  variable  X,  which  are  arranged  in 
ascending  order  as  V(i)  <  X(2)  <  •  •  •  <  X(n)  (and  are  thus  on  an  at  least  ordinal  scale). 
The  empirical  cumulative  distribution  function  F(x)  is  defined  as  the  cumulative 
relative  frequencies  of  all  values  aj,  which  are  smaller  than,  or  equal  to,  v: 

F{x)  =  Y,  (2.2) 

Clj<X 

This  definition  implies  that  F(x)  is  a  monotonically  non-decreasing  function,  0  < 
F(x)  <  1,  lim^^-oo  F(x)  =  0  (the  lower  limit  of  F  is  0),  limx^+00  F(x)  =  1  (the 
upper  limit  of  F  is  1),  and  F(x)  is  right  continuous. 


2.2.1  ECDF  for  Ordinal  Variables 

The  empirical  cumulative  distribution  function  of  ordinal  variables  is  a  step  function. 

Example  2.2. 1  Consider  a  customer  satisfaction  survey  from  a  car  service  company. 
The  200  customers  who  had  a  car  service  done  within  the  last  30  days  were  asked  to 
respond  regarding  their  overall  level  of  satisfaction  with  the  quality  of  the  car  service 
on  a  scale  from  1  to  5  based  on  the  following  options:  1  =  not  satisfied  at  all,  2  = 
unsatisfied,  3  =  satisfied,  4  =  very  satisfied,  and  5  =  perfectly  satisfied.  Based  on 
the  frequency  of  each  option,  we  can  calculate  the  relative  frequencies  and  then 
plot  the  empirical  cumulative  distribution  function,  either  manually  (takes  longer) 
or  by  using  R  (quick): 


Satisfaction  level  (aj ) 

7  =  3 

j= 4 

j  =  5 

nJ 

4 

16 

90 

70 

20 

fj 

4/200 

16/200 

90/200 

70/200 

Fj 

4/200 

20/200 

110/200 

180/200 

200/200 

2.2  Empirical  Cumulative  Distribution  Function 
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Fig.  2.1  ECDF  for  the  satisfaction  survey 


The  Fj’s  are  calculated  as  follows: 

Fl  =  fl,  F3  =  /i+/2  +  /3, 

Fl  =  f\  +  fl,  F4  =  f\  +  f2  +  /3  +  /4. 

The  ECDF  for  this  data  can  be  obtained  by  summarizing  the  data  in  a  vector  and 
using  the  plot .  ecdf  ()  function  in  R ,  see  Fig.  2.1: 

sv  <-  c(rep(l,4) ,rep(2,16) ,rep(3,90) ,rep(4,70) ,rep(5,20)) 
plot . ecdf (sv) 


Satisfaction 


The  ECDF  can  be  used  to  obtain  the  relative  frequencies  for  values  contained  in 
certain  intervals  as 

H(c  <  x  <  d)  =  relative  frequency  of  values  x  with  c  <  x  <  d. 

It  further  follows  that: 


H(x 

< 

dj)  ■■ 

=  F(dj) 

(2.3) 

H(x 

< 

Clj)  ■■ 

=  H(x  <  aj)  -  f(aj)  =  F(dj)  -  f(dj) 

(2.4) 

H(x 

> 

ttj)  ■■ 

=  1  -  H(x 

<  oj)  = 

1  -  Fia,) 

(2.5) 

H(x 

> 

dj)  : 

=  1  -  H(X 

<  ai )  = 

1  -  F(dj)  +  f(aj) 

(2.6) 

H  (ah 

<  X 

< 

ah)  ■■ 

=  F(dj2)  - 

F(aj  1)  +  f(ah) 

(2.7) 

H  (ajl 

<  X 

< 

ah)  ■■ 

=  F{dh)  - 

F(dh) 

(2.8) 

H  {ah 

<  X 

< 

ah)  ■■ 

=  F(dj2)  - 

F(aj  1)  - 

-  f(aj2 ) 

(2.9) 

H  {ah 

<  X 

< 

dh)  : 

=  F(dh )  - 

F(dj  1)  - 

-  /(«;2)  + 

(2.10) 
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2  Frequency  Measures  and  Graphical  Representation  of  Data 


Example  2.2.2  Suppose,  in  Example  2.2.1,  we  want  to  know  how  many  customers 
are  not  satisfied  with  their  car  service.  Then,  using  the  data  relating  to  the  responses 
“1”  and  “2”,  we  observe  from  the  ECDF  that  (16  +  4)/200  %  =  10  %  of  the  cus¬ 
tomers  were  not  satisfied  with  the  car  service.  This  relates  to  using  rule  (2.3): 
H(X  <  2)  =  F( 2)  =  0.1  or  10  %.  Similarly,  the  proportion  of  customers  who  are 
more  than  satisfied  can  be  obtained  using  (2.5)  as  H(X  >  3)  =  1  —  H(x  <  3)  = 
1  -  1 10/200  =  0.45  %  or  45  %. 


2.2.2  ECDF  for  Continuous  Variables 


In  general,  we  can  apply  formulae  (2.2)-(2.10)  to  continuous  data  as  well.  However, 
before  demonstrating  their  use,  let  us  consider  a  somewhat  different  setting.  Let  us 
assume  that  a  continuous  variable  of  interest  is  only  available  in  the  form  of  grouped 
data.  We  may  assume  that  the  observations  within  each  group,  i.e.  each  category 
or  each  interval,  are  distributed  uniformly  over  the  entire  interval.  The  ECDF  then 
consists  of  straight  lines  connecting  the  lower  and  upper  values  of  the  ECDF  in  each 
of  the  intervals.  To  understand  this  concept  in  more  detail,  we  introduce  the  following 
notation: 

k  number  of  groups  (or  intervals), 

ej-i  lower  limit  of  yth  interval, 

ej  upper  limit  of  j th  interval, 

dj  =  ej  —  ej- 1  width  of  the  jth  interval, 
nj  number  of  observations  in  the  j  th  interval. 

Under  the  assumption  that  all  values  in  a  particular  interval  are  distributed  uni¬ 
formly  within  this  interval,  the  empirical  cumulative  distribution  function  relates  to  a 
polygonal  chain  connecting  the  points  (0,  0),  (e\,  F(e i)),  (<?2,  Ffa)),  . . . ,  (gg  1). 
The  ECDF  can  then  be  defined  as 


F(x)  = 


0, 

•  F(ej- 1)  + 

1, 


x  <  eo 

x  e  [ej-i,ej) 
x  >  ek 


(2.11) 


with  F(e o)  =  0.  The  idea  behind  (2.11)  is  presented  in  Fig.  2.2.  For  any  interval 
[e  j  _  i ,  ej),  the  respective  lower  and  upper  limits  of  the  ECDF  are  F  ( ej )  and  F  (e  j  _  i ) . 
If  we  assume  values  to  be  distributed  uniformly  over  this  interval,  we  can  connect 
F(ej)  and  F{ej- 1)  with  a  straight  line.  To  obtain  F(x)  with  v  >  ej- \  and  v  <  ej, 
we  simply  add  the  height  of  the  ECDF  between  F(ej-\)  and  F(x)  to  F(ej-i). 


Example  2.2.3  Consider  Example  2.1.3  of  the  pizza  delivery  service.  Suppose  we 
are  interested  in  determining  the  distribution  of  the  pizza  delivery  times.  Using 
the  function  plot .  ecdf  ()  in  R,  we  obtain  the  ECDF  of  the  continuous  data,  see 
Fig.  2.3a.  Note  that  the  structure  of  the  curve  is  a  step  function  but  now  almost  looks 
like  a  continuous  curve.  The  reason  for  this  is  that  when  the  number  of  observations  is 
large,  then  the  lengths  of  class  intervals  become  small.  When  these  small  lengths  are 
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Fig.  2.2  Illustration  of  the 
ECDF  for  continuous  data 
available  in  groups/intervals* 


joined  together,  they  appear  like  a  continuous  curve.  As  the  number  of  observations 
increases,  the  smoothness  of  the  curve  increases  too.  If  the  number  of  observations 
is  not  large,  e.g.  suppose  the  data  is  reported  as  a  summary  from  the  drivers,  i.e. 
whether  the  delivery  took  <15  min,  between  15  and  20  min,  between  20  and  25  min, 
and  so  on,  then  we  can  construct  the  ECDF  by  creating  a  table  summarizing  the  data 
features  as  in  Table  2.2. 

Figure  2.3b  shows  the  ECDF  based  on  the  grouped  data  evaluated  in  Table  2.2.  It 
is  interesting  to  see  that  the  graphs  emerging  from  the  use  of  the  grouped  data  and 
ungrouped  data  are  similar  in  this  specific  example. 

Suppose  we  are  interested  in  calculating  how  many  deliveries  were  completed 
within  the  desired  time  limit  of  30  min,  with  a  tolerance  of  maximum  10  % 
deviation,  i.e.  a  deviation  of  3  min.  We  can  evaluate  the  ECDF  at  v  =  33  min. 


Table  2.2  The  values  needed  to  calculate  the  ECDF  for  the  grouped  pizza  delivery  time  data  in 
Example  2.2.3 


Delivery  time 

j 

*7-1 

ej 

nj 

fj 

F(ej) 

[0;  10] 

1 

0 

10 

0 

0.0000 

0.0000 

(10;  15] 

2 

10 

15 

3 

0.0024 

0.0024 

(15;  20] 

3 

15 

20 

21 

0.0166 

0.0190 

(20;  25] 

4 

20 

25 

75 

0.0592 

0.0782 

(25;  30] 

5 

25 

30 

215 

0.1698 

0.2480 

(30;  35] 

6 

30 

35 

373 

0.2946 

0.5426 

(35;  40] 

7 

35 

40 

350 

0.2765 

0.8191 

(40;  45] 

8 

40 

45 

171 

0.1351 

0.9542 

(45;  50] 

9 

45 

50 

52 

0.0411 

0.9953 

(50;  55] 

10 

50 

55 

6 

0.0047 

1 .0000 
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(b)  Grouped  continuous  data 


Fig.  2.3  Empirical  cumulative  distribution  function  for  pizza  delivery  time 


Based  on  (2. 11),  we  calculate  H(X  <  33)  =  F( 33)  =  F(30)  +  /(6)/ 5(33  —  30)  = 
0.2480  +  0.2946/5  •  3  =  0.42476.  Thus,  we  conclude,  based  on  the  grouped  data, 
that  only  about  42  %  of  the  deliveries  were  completed  in  the  desired  time  frame. 


2.3  Graphical  Representation  of  a  Variable 

Frequency  tables  and  empirical  cumulative  distribution  functions  are  useful  in  provid¬ 
ing  a  numerical  summary  of  a  variable.  Graphs  are  an  alternative  way  to  summarize 
a  variable’s  information.  In  many  situations,  they  have  the  advantage  of  conveying 
the  information  hidden  in  the  data  more  compactly.  Similarly,  someone’s  mood  can 
be  more  easily  understood  when  looking  at  a  smiley  ©  than  by  reading  an  essay  about 
one’s  mood  in  a  long  paragraph. 


2.3.1  BarChart 

A  simple  tool  to  visualize  the  relative  or  absolute  frequencies  of  observed  values  of 
a  variable  is  a  bar  chart.  A  bar  chart  can  be  used  for  nominal  and  ordinal  variables, 
as  long  as  the  number  of  categories  is  not  very  large.  It  consists  of  one  bar  for  each 
category.  The  height  of  each  bar  is  determined  by  either  the  absolute  frequency  or 
the  relative  frequency  of  the  respective  category  and  is  shown  on  the  y-axis.  If  the 
variable  is  measured  on  an  ordinal  level,  then  it  is  recommended  to  arrange  the  bars 
on  the  v-axis  according  to  their  ranks  or  values.  If  the  number  of  categories  is  large, 
then  the  number  of  bars  will  be  large  too  and  the  bar  chart,  in  turn,  may  not  remain 
informative. 
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(a)  With  absolute  frequencies 


(b)  With  relative  frequencies 


Fig.  2.4  Bar  charts 


Example  2.3. 1  Consider  Example  2. 1 . 1  in  which  ten  people,  queueing  in  a  supermar¬ 
ket,  were  classified  as  being  either  male  (M)  or  female  (F).  The  absolute  frequencies 
for  males  and  females  are  n\  =7  and  ri2  =  3,  respectively.  Since  there  are  two  cate¬ 
gories,  M  and  F,  two  bars  are  needed  to  construct  the  chart — one  for  the  male  category 
and  another  for  the  female  category.  The  heights  of  the  bars  are  determined  as  either 
n\  =1  and  ri2  =  3  or  f\  =  0.7  and  /2  =  0.3.  These  graphs  are  shown  in  Fig.  2.4. 


Example  2.3.2  Consider  the  data  in  Example  2.F3,  where  the  pizza  delivery  times 
for  each  branch  are  recorded  over  a  period  of  1  month.  The  frequency  table  forms  the 
basis  for  the  bar  chart,  either  using  the  absolute  or  relative  frequencies  on  the  y-axis. 
Figure  2.5  shows  the  bar  charts  for  the  number  and  proportion  of  pizza  deliveries  per 
branch.  The  graphs  can  be  produced  in  R  by  applying  the  barplot  command  to  a 
frequency  table: 

barplot (table (branch) ) 

barplot (table (branch) /length (branch) ) 


Remark  2.3. 1  Instead  of  vertical  bars,  horizontal  bars  can  be  drawn  using  the  optional 
argument  horiz=TRUE  in  the  barplot  command. 
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O  LO 


Branch 


Branch 


(a)  With  absolute  frequencies 


(b)  With  relative  frequencies 


Fig.  2.5  Bar  charts  for  the  pizza  deliveries  per  branch 


2.3.2  Pie  Chart 

Pie  charts  are  another  option  to  visualize  the  absolute  and  relative  frequencies  of 
nominal  and  ordinal  variables.  A  pie  chart  is  a  circle  partitioned  into  segments, 
where  each  of  the  segments  represents  a  category.  The  size  of  each  segment  depends 
upon  the  relative  frequency  and  is  determined  by  the  angle  fj  •  360°. 

Example  2.3.3  To  illustrate  the  construction  of  a  pie  chart,  let  us  consider  again 
Example  2.1.1  in  which  ten  people  in  a  supermarket  queue  were  classified  as  being 
either  male  (M)  or  female  (F):  M,  F,  M,  F,  M,  M,  M,  F,  M,  M.  The  pie  chart  for  this 
data  will  have  two  segments:  one  for  males  and  another  one  for  females.  The  relative 
frequencies  are  f\  =7/10  and  fa  =  3/10,  respectively.  The  size  of  the  segment 
for  the  first  category  (M)  is  f\  •  360°  =  (7/10)  •  360°  =  252°,  and  the  size  of  the 
segment  for  the  second  category  (F)  is  /2  •  360°  =  (3/10)  •  360°  =  108°.  The  pie 
chart  is  shown  in  Fig.  2.6a. 

Example  2.3.4  Consider  again  Example  2.2.1,  where  200  customers  were  asked 
about  their  level  of  satisfaction  (5  categories)  with  their  car  service.  The  pie  chart 
for  this  example  consists  of  five  segments  representing  the  categories  1,  2,  3,  4, 
and  5.  The  size  of  the  yth  segment  is  fj  •  360°,  j  =  1,  2,  3,  4,  5.  For  example,  for 
category  1,  there  are  4  out  of  200  customers,  who  are  not  satisfied  at  all.  The  angle 
of  the  segment  “not  satisfied  at  all”  therefore  is  f\  •  360°  =  4/200  •  360°  =1.2°. 
Similarly,  we  can  calculate  the  angle  of  the  other  segments  and  obtain  a  pie  chart  as 
shown  in  Fig.  2.6b  using  the  pie  command  in  R 

pie (table (sv) ) 
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not  statisfied  at  all 


(a)  For  gender  of  people  queueing 


(b)  For  satisfaction  with  the  car  service 


Fig.  2.6  Pie  charts 


Remark  2.3.2  Note  that  the  area  of  a  segment  is  not  proportional  to  the  absolute 
frequency  of  the  respective  category.  Instead,  the  area  of  the  segment  is  proportional 
to  the  angle  fj  •  360°  (and  depends  also  on  the  radius  of  the  whole  circle).  It  has  been 
argued  that  this  may  cause  improper  interpretations  as  the  human  eye  may  catch  the 
segment’s  area  more  easily  than  the  angle  of  a  segment.  Pie  charts  should  therefore 
be  used  with  caution. 


2.3.3  Histogram 

If  a  variable  consists  of  a  large  number  of  different  values,  the  number  of  categories 
used  to  construct  bar  charts  will  consequently  be  large  too.  A  bar  chart  may  thus  not 
give  a  clear  summary  when  applied  to  a  continuous  variable.  Instead,  a  histogram  is 
the  appropriate  choice  to  represent  the  distribution  of  values  of  continuous  variables. 
It  is  based  on  the  idea  to  categorize  the  data  into  different  groups  and  plot  the  bars 
for  each  category  with  height  hj  =  fj /dj,  where  dj  =  ej  —  ej-\  denotes  the  width 
of  the  jth  class  interval  or  category.  An  important  consideration  for  this  concept  is 
that  the  area  of  the  bars  (=height  x  width)  is  proportional  to  the  relative  frequency. 
This  means  that  the  widths  of  the  bars  need  not  necessarily  to  be  the  same  because 
different  widths  can  be  adjusted  with  different  heights  of  the  bars. 

Example  2.3.5  Consider  Example  2. 1 .2,  where  n  =  20  people  were  divided  into  five 
class  intervals  0-20,  21-40,  41-60,  61-80,  and  81-100  based  on  their  performance 
in  a  written  driving  licence  examination.  The  frequency  table  is  given  as 
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Class  intervals 

0-20 

21-40 

41-60 

61-80 

81-100 

Absolute  freq 

n  i  =  1 

ri2  —  3 

n3  =3 

On 

II 

£ 

n5  =4 

Relative  freq 

> 

II 

oh 

h  =  k 

II 

h® 

II 

< 

II 

Height  fj/dj 

Sr- 

II 

o 

o 

ho  —  3 

n2  —  400 

^3  —  400 

0 

CT\  O 

Tf 

II 

^4" 

A5  —  400 

Fig.  2.7  Histogram  for  the 
scores  of  the  people 


LO 

CNJ 


Score 


The  histogram  for  this  grouped  data  set  has  five  categories  and  therefore  it  has 
five  bars.  Since  the  widths  of  class  intervals  are  the  same,  the  heights  of  the  bars 
are  proportional  to  the  relative  frequency  of  the  respective  category.  The  resulting 
histogram  is  displayed  in  Fig.  2.7. 


Example  2.3. 6  Recall  Example  2.2.3  and  the  variable  “pizza  delivery  time”.  Table  2.3 
shows  the  summary  of  the  grouped  data  and  the  values  needed  to  calculate  the  his¬ 
togram.  Figure  2.8a  shows  the  histogram  with  equal  widths  of  delivery  time  intervals. 
We  see  a  symmetric  distribution  of  the  pizza  delivery  times,  but  many  delivery  times 
exceeding  the  target  time  of  30  min.  If  the  histogram  is  required  to  have  different 
widths  for  different  bars,  i.e.  different  delivery  time  intervals  for  different  categories, 
then  it  can  also  be  constructed  as  shown  in  Fig.  2.8b.  This  representation  is  different 
from  Fig.  2.8a.  The  following  commands  in  R  are  used  to  construct  the  histograms 
for  absolute  and  relative  frequencies,  respectively: 


hist (time)  #  show  abs .  frequencies 

hist (time,  freq=F)  #  show  rel.  frequencies 


Remark  2.3.3  The  R  command  truehist  ( )  from  the  library  MASS  presents  an  alter¬ 
native  to  the  hist  ()  command.  The  default  specifications  are  somewhat  different, 
and  many  users  prefer  it  to  the  command  hist. 
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Table  2.3  Values  needed  to  calculate  the  histogram  for  the  grouped  pizza  delivery  time  data 


Delivery  time 

j 

ej- 1 

ej 

dj 

fi 

hj 

[0;  10] 

1 

0 

10 

10 

0.0000 

0.00000 

(10;  15] 

2 

10 

15 

5 

0.0024 

0.00047 

(15;  20] 

3 

15 

20 

5 

0.0166 

0.00332 

(20;  25] 

4 

20 

25 

5 

0.0592 

0.01185 

(25;  30] 

5 

25 

30 

5 

0.1698 

0.03397 

(30;  35] 

6 

30 

35 

5 

0.2946 

0.05893 

(35;  40] 

7 

35 

40 

5 

0.2765 

0.05529 

(40;  45] 

8 

40 

45 

5 

0.1351 

0.02701 

(45;  50] 

9 

45 

50 

5 

0.0411 

0.00821 

(50;  55] 

10 

50 

55 

5 

0.0047 

0.00094 

CD 
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10  20  30  40  50  60 

Delivery  Time  (in  min) 


With  same  width  for  each  category  (b)  With  different  widths  per  category 


Fig.  2.8  Histogram  for  pizza  delivery  time 


2.4  Kernel  Density  Plots 

A  disadvantage  of  histograms  is  that  continuous  data  is  categorized  artificially.  The 
choice  of  the  class  intervals  is  crucial  for  the  final  look  of  the  graph.  A  more  elegant 
way  to  deal  with  this  problem  is  to  smooth  the  histogram  in  the  sense  that  each  obser¬ 
vation  may  contribute  to  different  classes  with  different  weights,  and  the  distribution 
is  represented  by  a  continuous  function  rather  than  a  step  function.  A  kernel  density 
plot  can  be  produced  by  using  the  following  function: 
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(a)  Epanechnikow  kernel  density  estimate  (b)  Different  kernel  choices 

Fig.  2.9  Construction  of  kernel  density  plots 


fn  CO  = 


5^(4* ) ■ 

i= 1  v  7 


h  >  0, 


(2.12) 


where  n  is  the  sample  size,  h  is  the  bandwidth,  and  K  is  a  kernel  function,  for 
example 


K(x)  = 


K(x)  = 


\  if  -  1  <  x  <  1 
0  elsewhere 

|(1  —  x2)  if  \x\  <  1 
0  elsewhere. 


(rectangular  kernel) 
(Epanechnikov  kernel) 


To  better  understand  this  concept,  consider  Fig.  2.9a.  The  tick  marks  on  the  v-axis 
represent  five  observations:  3,  6,  7,  8,  and  10.  On  each  observation  Xi  as  well  as  its 
surrounding  values,  we  apply  a  kernel  function,  which  is  the  Epanechnikov  kernel  in 
the  figure.  This  means  that  we  have  five  functions  (grey,  dashed  lines),  which  refer  to 
the  five  observations.  These  functions  are  largest  at  the  observation  itself  and  become 
gradually  smaller  as  the  distance  from  the  observation  increases.  Summing  up  the 
functions,  as  described  in  Eq.  (2.12),  yields  the  solid  black  line,  which  is  the  kernel 
density  plot  of  the  five  observations.  It  is  a  smooth  curve,  which  represents  the  data 
distribution.  The  degree  of  smoothness  can  be  controlled  by  the  bandwidth  h,  which 
is  chosen  as  2  in  Fig.  2.9a. 
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(a)  Default  bandwidth 


Fig.  2.10  Kernel  density  plot  for  delivery  time 


The  choice  of  the  kernel  may  affect  the  overall  look  of  the  plot.  Above,  we  have 
given  the  functions  for  the  rectangular  and  Epanechnikov  kernels.  However,  another 
common  function  for  kernel  density  plots  is  the  normal  distribution  function,  which 
is  introduced  in  Sect.  8.2.2,  see  Fig.  2.9b  for  a  comparison  of  different  kernels.  The 
kernel  which  is  based  on  the  normal  distribution  is  called  the  ‘‘Gaussian  kernel”  and 
is  the  default  in  R ,  where  a  kernel  density  plot  can  be  produced  combining  the  plot 
and  density  commands: 

example  <-  c (3, 6, 7, 8, 10) 

plot (density (example ,  kernel= 1 gaussian ' ) ) 

Please  note  that  kernel  functions  are  not  defined  arbitrarily  and  need  to  satisfy  cer¬ 
tain  conditions,  such  as  those  required  for  probability  density  functions  as  explained 
in  Chap.  7,  Theorem  7.2.1. 

Example  2.4.1  Let  us  consider  the  pizza  data  which  we  introduced  earlier  and  in 
Appendix  A.4.  We  can  summarize  the  delivery  time  by  using  a  kernel  density  plot 
using  the  R  command  plot  (density  (time) )  and  compare  it  with  a  histogram, 
see  Fig.  2.10a.  We  see  that  the  delivery  times  are  symmetric  around  35  min.  If  we 
shorten  the  bandwidth  to  a  half  of  the  default  bandwidth  (option  adjust=0 . 5),  the 
kernel  density  plot  becomes  more  wiggly,  which  is  illustrated  in  Fig.  2.10b. 
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2.5  Key  Points  and  Further  Issues 


Note: 


/Bar  charts  and  histograms  are  not  the  same  graphical  tools.  Bar  charts 
visualize  the  categories  of  nominal  or  ordinal  variables  whereas  his¬ 
tograms  visualize  the  distribution  of  continuous  variables.  A  bar  chart 
does  not  require  to  have  ordered  values  on  the  v-axis,  but  a  histogram 
always  requires  the  values  on  the  v-axis  to  be  on  a  continuous  scale  and 
to  be  ordered.  The  interpretation  of  a  histogram  is  simplified  if  the  class 
intervals  are  equally  sized,  since  then  the  heights  of  the  rectangles  of 
the  histogram  are  proportional  to  the  absolute  or  relative  frequencies. 

/  The  ECDF  can  be  used  only  for  ordinal  and  continuous  variables,  see 
Sect.  7.2  for  the  theoretical  background  of  the  cumulative  distribution 
function. 

/  A  pie  chart  summarizes  observations  from  a  discrete  (nominal,  ordi¬ 
nal  or  grouped  continuous)  variable.  It  is  only  useful  if  the  number  of 
different  values  (categories)  is  small.  It  is  to  be  kept  in  mind  that  the 
area  of  each  segment  is  not  proportional  to  the  absolute  frequency  of 
the  respective  category.  The  angle  of  the  segment  is  proportional  to  the 
relative  frequency  of  the  respective  category. 

/  Other  possibilities  to  visualize  the  distribution  of  variables  are,  for  exam¬ 
ple,  box  plots  (Sect.  3.3)  and  stratified  plots  (Sects.  4.1.3, 4.3.1,  and  4.4). 


2.6  Exercises 


Exercise  2.1  Consider  the  results  of  the  national  elections  in  South  Africa  in  2014 
and  2009: 


Party 

Results  2014  (%)  Results  2009  (%) 

ANC 

(African  National  Congress) 

62.15 

65.90 

DA 

(Democratic  Alliance) 

22.23 

16.66 

EFF 

(Economic  Freedom  Fighters)  6.35 

— 

IFP 

(Inkatha  Freedom  Party) 

2.40 

4.55 

COPE  (Congress  of  the  People) 

0.67 

7.42 

Others 

6.20 

5.47 

(a)  Summarize  the  results  of  the  2014  elections  in  a  bar  chart.  Do  it  manually  and 
by  using  R. 

(b)  How  would  you  compare  the  results  of  the  2009  and  2014  elections?  Offer  a 
simple  solution  that  can  be  represented  in  a  single  plot.  Construct  this  plot  in  R. 
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Exercise  2.2  Consider  a  variable  X  describing  the  time  until  the  first  goal  was  scored 
in  the  matches  of  the  2006  football  World  Cup  competition.  Only  matches  with  at 
least  one  goal  are  considered,  and  goals  during  the  xth  minute  of  extra  time  are 
denoted  as  90  +  v : 


6 

24 

90+1 

8 

4 

25 

3 

83 

89 

34 

25 

24 

18 

6 

23 

10 

28 

4 

63 

6 

60 

5 

40 

2 

22 

26 

23 

26 

44 

49 

34 

2 

33 

9 

16 

55 

23 

13 

23 

4 

8 

26 

70 

4 

6 

60 

23 

90+5 

28 

49 

6 

57 

33 

56 

7 

(a)  What  is  the  scale  of  X  ? 

(b)  Write  down  the  frequency  table  of  X  based  on  the  following  categories:  [0,  15), 
[15,  30),  [30,  45),  [45,  60),  [60,  75),  [75,  90),  [90,  96). 

(c)  Draw  the  histogram  for  X  with  intervals  relating  to  the  groups  from  the  frequency 
table. 

(d)  Now  use  R  to  reproduce  the  histogram.  Compare  the  histogram  to  a  kernel 
density  plot  of  your  choice. 

(e)  Calculate  the  empirical  cumulative  distribution  function  for  the  grouped  data. 

(f)  Use  R  to  plot  the  ECDF  (via  a  step  function)  for 

(i)  the  original  data  and 

(ii)  the  grouped  data. 

(g)  Consider  the  grouped  data.  Now  assume  that  the  values  within  each  interval  are 
distributed  uniformly.  Determine  the  proportion  of  first  goals  which  occurred 

(i)  in  the  first  half,  i.e.  during  the  first  45  min, 

(ii)  in  the  last  10  min  or  during  the  extra  time, 

(iii)  between  the  20th  and  65th  min,  i.e.  what  is  7/(20  <  X  <  65)7 

(h)  Determine  the  time  point  at  which  in  80  %  of  the  matches  the  first  goal  was 
scored  at  or  before  this  time  point. 

Exercise  2.3  Suppose  we  have  the  following  information  to  construct  a  histogram 
for  a  continuous  variable  with  2000  observations: 


j 

ei- 1 

ej 

dj 

hj 

1 

0 

1 

1 

0.125 

2 

1 

4 

3 

0.125 

3 

4 

7 

3 

0.125 

4 

7 

8 

1 

0.125 

(a)  Determine  the  relative  frequencies  for  each  interval. 

(b)  Determine  the  absolute  frequencies. 


34 


2  Frequency  Measures  and  Graphical  Representation  of  Data 


Exercise  2.4  A  university  survey  was  conducted  on  500  first-year  students  to  obtain 
knowledge  about  the  size  of  their  accommodation  (in  square  metres). 


j 

7 

Size  of  accommodation  (m  ) 

ei- 1  <x<ej 

Fix) 

1 

8-14 

0.25 

2 

14-22 

0.40 

3 

22-34 

0.75 

4 

34-50 

0.97 

5 

50-82 

1.00 

(a)  Determine  the  absolute  frequencies  for  each  category. 

(b)  What  proportion  of  people  live  in  a  flat  of  at  least  34  m2? 

Exercise  2.5  Consider  a  survey  in  which  100  people  were  asked  to  rate  on  a  scale 
from  1  to  10  how  much  they  agree  with  the  statement  that  “there  is  too  much  football 
on  television”.  The  results  are  summarized  below: 


Score 

012345  6  7  89  10 

Responses 

0  1  3  8  8  27  30  11  6  4  2 

(a)  Calculate  and  draw  the  ECDF  of  the  scores. 

(b)  Determine  F( 3)  and  F( 9). 

(c)  Consider  the  situation,  where  the  data  is  summarized  in  the  two  categories  “dis¬ 
agree”  (score  <  5)  and  “agree”  (score  >  5).  What  would  the  ECDF  look  like 
under  the  approach  outlined  in  (2.11)?  Determine  F( 3)  and  F( 9)  for  the  sum¬ 
marized  data. 

(d)  Explain  the  differences  between  (b)  and  (c). 

Exercise  2.6  It  is  possible  to  produce  professional  graphics  in  R.  However,  it  is 

advantageous  to  go  beyond  the  default  options.  To  demonstrate  this,  consider  Exam¬ 
ple  2.1.3  about  the  pizza  delivery  data,  which  is  described  in  Appendix  A.4. 

(a)  Set  the  working  directory  in  R  (setwd  ( ) ),  read  in  the  data  (read .  csv  0 ),  and 
attach  the  data.  Draw  a  histogram  of  the  variable  “temperature”.  Type  ?hist, 
and  view  the  options.  Adjust  the  histogram  so  that  you  are  satisfied  with  (i)  axes 
labelling,  (ii)  axes  range,  and  (iii)  colour.  Now  use  the  lines  0  command  to 
add  a  dashed  vertical  line  at  65  °C  (which  is  the  minimum  temperature  the  pizza 
should  have  at  the  time  of  delivery). 

(b)  Consider  a  different  approach,  which  constructs  plots  by  means  of  multiple  lay¬ 
ers  using  ggplot2.  You  need  an  Internet  connection  to  install  the  package  using 
the  command  install  .packages  ( ’ggplot2D.  Browse  through  the  help 


2.6  Exercises 


35 


pages  on  http://docs.ggplot2.org/current/.  Look  specifically  at  the  examples  for 
ggplot,  qplot,  scale_histogram,  and  scale_y_continuous.  Try  to 
understand  the  roles  of  “aesthetics”  and  “geoms”.  Now,  after  loading  the  library 
via  library  (ggplot  2) ,  create  a  ggplot  object  for  the  pizza  data,  which  declares 
“temperature”  to  be  the  v -variable.  Now  add  a  layer  with  geom_histogram  to 
create  a  histogram  with  interval  width  of  2.5  and  dark  grey  bars  which  are 
50  %  transparent.  Change  the  y-axis  labelling  by  adding  the  relevant  layer  using 
scale_y_continuous.  Plot  the  graph. 

(c)  Now  create  a  normal  bar  chart  for  the  variable  “driver”  in  R.  Type  Tbarplot 
and  ?par  to  see  the  options  one  can  pass  on  to  barchart  ( )  to  adjust  the  graph. 
Make  the  graph  look  good. 

(d)  Now  create  the  same  bar  chart  with  ggplot2.  Use  qplot  instead  of  ggplot 
to  create  the  plot.  Use  an  option  which  makes  each  bar  to  consist  of  segments 
relating  to  the  day  of  delivery,  so  that  one  can  see  the  number  of  deliveries  by 
driver  to  highlight  during  which  days  the  drivers  delivered  most  often.  Browse 
through  “themes”  and  “scales”  on  the  help  page,  and  add  layers  that  make  the 
background  black  and  white  and  the  bars  on  a  grey  scale. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  325 


*  Source  Toutenburg,  H.,  Heumann,  C.,  Deskriptive  Statistik ,  7th  edition,  2009, 
Springer,  Heidelberg 


Measures  of  Central  Tendency  and 
Dispersion 


A  data  set  may  contain  many  variables  and  observations.  However,  we  are  not  always 
interested  in  each  of  the  measured  values  but  rather  in  a  summary  which  interprets  the 
data.  Statistical  functions  fulfil  the  purpose  of  summarizing  the  data  in  a  meaningful 
yet  concise  way. 

Example  3.0.1  Suppose  someone  from  Munich  (Germany)  plans  a  holiday  in 
Bangkok  (Thailand)  during  the  month  of  December  and  would  like  to  get  infor¬ 
mation  about  the  weather  when  preparing  for  the  trip.  Suppose  last  year’s  maximum 
temperatures  during  the  day  (in  degrees  Celsius)  for  December  1-31  are  as  follows: 

22,  24,  21,  22,  25,  26,  25,  24,  23,  25,  25,  26,  27,  25,  26, 

25,  26,  27,  27,  28,  29,  29,  29,  28,  30,  29,  30,  31,  30,  28,  29. 

How  do  we  draw  conclusions  from  this  data?  Looking  at  the  individual  values  gives 
us  a  feeling  about  the  temperatures  one  can  experience  in  Bangkok,  but  it  does  not 
provide  us  with  a  clear  summary.  It  is  evident  that  the  average  of  these  3 1  values  as 
“Sum  of  all  values/Total  number  of  observations”  (22  +  24  +  •  •  •  +  28  +  29)/31  = 
26.48  is  meaningful  in  the  sense  that  we  know  what  temperature  to  expect  “on 
average”.  To  choose  the  right  clothing  for  the  holidays,  we  may  also  be  interested  in 
knowing  the  temperature  range  to  understand  the  variability  in  temperature,  which 
is  between  21  and  3 1  °C.  Summarizing  31  individual  values  with  only  three  numbers 
(26.48,  21,  and  31)  will  provide  sufficient  information  to  plan  the  holidays. 

In  this  chapter,  we  focus  on  the  most  important  statistical  concepts  to  summarize 
data:  these  are  measures  of  central  tendency  and  variability.  The  applications  of  each 
measure  depend  on  the  scale  of  the  variable  of  interest,  see  Appendix  D.l  for  a 
detailed  summary. 


©  Springer  International  Publishing  Switzerland  2016 
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3  Measures  of  Central  Tendency  and  Dispersion 


3.1  Measures  of  Central  Tendency 

A  natural  human  tendency  is  to  make  comparisons  with  the  “average”.  For  example, 
a  student  scoring  40  %  in  an  examination  will  be  happy  with  the  result  if  the  average 
score  of  the  class  is  25  %.  If  the  average  class  score  is  90  %,  then  the  student  may  not 
feel  happy  even  if  he  got  70  %  right.  Some  other  examples  of  the  use  of  “average” 
values  in  common  life  are  mean  body  height,  mean  temperature  in  July  in  some 
town,  the  most  often  selected  study  subject,  the  most  popular  TV  show  in  2015,  and 
average  income.  Various  statistical  concepts  refer  to  the  “average”  of  the  data,  but 
the  right  choice  depends  upon  the  nature  and  scale  of  the  data  as  well  as  the  objective 
of  the  study.  We  call  statistical  functions  which  describe  the  average  or  centre  of  the 
data  location  parameters  or  measures  of  central  tendency. 


3.1 .1  Arithmetic  Mean 


The  arithmetic  mean  is  one  of  the  most  intuitive  measures  of  central  tendency. 
Suppose  a  variable  of  size  n  consists  of  the  values  x\,  X2,  ...  ,xn.  The  arithmetic 
mean  of  this  data  is  defined  as 


v  = 


(3.1) 


In  informal  language,  we  often  speak  of  “the  average”  or  just  “the  mean”  when  using 
the  formula  (3.1). 

To  calculate  the  arithmetic  mean  for  grouped  data,  we  need  the  following  fre¬ 
quency  table: 


Class  intervals  aj 

a\  —  eo  —  e\ 

CN 

II 

(N 

Q 

. . . 

0>k  —  ek- 1  -  Ck 

Absolute  freq.  nj 

n\ 

ri2 

.  .  . 

nk 

Relative  freq.  fj 

h 

h 

. . . 

fk 

Note  that  ai,  <22,  . . . ,  ak  are  the  k  class  intervals  and  each  interval  aj  (j  =  1,2, ...  ,k) 
contains  nj  observations  with  X^=i  nj  =  n-  The  relative  frequency  of  the  j th  class 

is  fj  =  rij/n  and  Xy=i  fj  =  1*  The  mid- value  of  the  j  th  class  interval  is  defined  as 
rnj  =  (ej- 1  +  e  j  )/2,  which  is  the  mean  of  the  lower  and  upper  limits  of  the  interval. 
The  weighted  arithmetic  mean  for  grouped  data  is  defined  as 

1  k  k 

=  2Vm/' 

j= 1  j= 1 


v  = 


(3.2) 
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Example  3.1.1  Consider  again  Example  3.0.1  where  we  looked  at  the  temperature 
in  Bangkok  during  December.  The  measurements  were 


22,  24,  21,  22,  25,  26,  25,  24,  23,  25,  25,  26,  27,  25,  26, 

25,  26,  27,  27,  28,  29,  29,  29,  28,  30,  29,  30,  31,  30,  28,  29  . 

The  arithmetic  mean  is  therefore 

22  +  24  +  21  +  •  •  •  +  28  +  29 

x  =  — - - - - - - - - - —  =  26.48  °C. 

31 

In  R ,  the  arithmetic  mean  can  be  calculated  using  the  mean  command: 


weather  <-  c(22, 24, 21 , ,30,28,29) 
mean (weather) 

[1]  26.48387 


Let  us  assume  the  data  in  Example  3.0.1  is  summarized  in  categories  as 
follows: 


Class  intervals 

<  20 

(20  -  25] 

(25,  30] 

(30,  35] 

>  35 

Absolute  frequencies 

n\  =  0 

in  =  12 

/2  3  =  18 

«4  =  1 

n5  =0 

Relative  frequencies 

/i  =0 

2  is 

II 

r  18 

73  —  31 

HS 

II 

< 

o 

II 

We  can  calculate  the  (weighted)  arithmetic  mean  as 

E 

K  ^  12  18  1 

x  =  ^  f;m  j  =  0  4 - -  22.5  H - -  27.5  H - 32.5  +  0  ~  25.7. 

^  J  J  31  31  31 

7  =  1 

In  R,  we  use  the  weighted .  mean  function  to  obtain  the  result.  The  function  requires 
to  specify  the  (hypothesized)  means  for  each  group,  for  example  the  middle  values 
of  the  class  intervals,  as  well  as  the  weights. 


weighted. mean(c (22. 5, 27. 5, 32. 5) ,c (12/31, 18/31, 1/31)) 


Interestingly,  the  results  of  the  mean  and  the  weighted  mean  differ.  This  is  because 
we  use  the  middle  of  each  class  as  an  approximation  of  the  mean  within  the  class.  The 
implication  is  that  we  assume  that  the  values  are  uniformly  distributed  within  each 
interval.  This  assumption  is  obviously  not  met.  If  we  had  knowledge  about  the  mean 
in  each  class,  like  in  this  example,  we  would  obtain  the  correct  result  as  follows: 

12  18  1 

x  =  >  fjXj  =  0  +  —  •  23.83333  +  —  •  28  +  —32.5  +  0  =  26.48387. 

^ JJ  1  31  31  31 

7  =  1 

However,  the  weighted  mean  is  meant  to  estimate  the  arithmetic  mean  in  those 
situations  where  only  grouped  data  is  available.  It  is  therefore  typically  used  to 
obtain  an  approximation  of  the  true  mean. 
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Properties  of  the  Arithmetic  Mean. 


(i)  The  sum  of  the  deviations  of  each  variable  around  the  arithmetic  mean  is  zero: 

n  n 

^y^(xj  —  x)  =  ^^Xi  —  nx  =  nx  —  nx  =  0.  (3.3) 

i  =  \  i  =  1 

(ii)  If  the  data  is  linearly  transformed  as  yt  =  a  +  bxi ,  where  a  and  b  are  known 
constants,  it  holds  that 


1 

n 


n 

y  (a + bxj) 

i  —  \ 


1  n  b  U 

—  >  a  H —  7  Xi  =  a  +  bx. 

n  n 

i  =  l  i= 1 


(3.4) 


Example  3.1.2  Recall  Examples  3.0.1  and  3.1.1  where  we  considered  the  tempera¬ 
tures  in  December  in  Bangkok.  We  measured  them  in  degrees  Celsius,  but  someone 
from  the  USA  might  prefer  to  know  them  in  degrees  Fahrenheit.  With  a  linear  trans¬ 
formation,  we  can  create  a  new  temperature  variable  as 

Temperature  in°F  =  32  +  1.8  Temperature  in  °C. 

Using  y  =  a  +  bx ,  we  get  y  =  32  +  1.8  •  26.48  ~  79.7  °F. 


3.1 .2  Median  and  Quantiles 

The  median  is  the  value  which  divides  the  observations  into  two  equal  parts  such 
that  at  least  50  %  of  the  values  are  greater  than  or  equal  to  the  median  and  at  least 
50  %  of  the  values  are  less  than  or  equal  to  the  median.  The  median  is  denoted  by 
vo. 5 ;  then,  in  terms  of  the  empirical  cumulative  distribution  function,  the  condition 
F(x 0.5)  =  0.5  is  satisfied.  Consider  the  n  observations  x\,  X2,  . . . ,  xn  which  can 
be  ordered  as  V(i)  <  +2)  <  •  •  •  <  *(w).  The  calculation  of  the  median  depends  on 
whether  the  number  of  observations  n  is  odd  or  even.  When  n  is  odd,  then  V0.5  is  the 
middle  ordered  value.  When  n  is  even,  then  V0.5  is  the  arithmetic  mean  of  the  two 
middle  ordered  values: 

*((n+ 1)/2)  if  n  is  odd 

*05  =  1  1  /  ,  .  .  (3.5) 

.  2  (x(n/2)  +  -++2+1))  if  a  is  even. 

Example  3.1.3  Consider  again  Examples  3. 0.1 -3. 1.2  where  we  evaluated  the  tem¬ 
perature  in  Bangkok  in  December.  The  ordered  values  +/),  i  =  1,  2,  . . . ,  31,  are  as 
follows: 


°c 

21 

22 

22 

23 

24 

24 

25 

25 

25 

25 

25 

25 

26 

26 

26 

26 

(i) 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

°c 

27 

27 

27 

28 

28 

28 

29 

29 

29 

29 

29 

30 

30 

30 

31 

(i) 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

3.1  Measures  of  Central  Tendency 


41 


We  have  n  =  31,  and  therefore  vo.5  =  =  *((3i+i)/2)  —  *(16)  =  26.  There¬ 

fore,  at  least  50  %  of  the  31  observations  are  greater  than  or  equal  to  26  and  at  least 
50  %  are  less  than  or  equal  to  26.  If  one  value  was  missing,  let  us  say  the  last  observa¬ 
tion,  then  the  median  would  be  calculated  as  ^  (*(3o/2)  +  *(30/2+ 1))  =  \  (26  +  26)  = 
26.  In  R,  we  would  have  obtained  the  results  using  the  median  command: 

median (weather ) 

If  we  deal  with  grouped  data,  we  can  calculate  the  median  under  the  assumption 
that  the  values  within  each  class  are  equally  distributed.  Let  K\,  K2, . . . ,  be 
k  classes  with  observations  of  size  n\,  ri2,  . . . ,  rik,  respectively.  First,  we  need  to 
determine  which  class  is  the  median  class,  i.e.  the  class  that  includes  the  median.  We 
define  the  median  class  as  the  class  Km  for  which 

m  —  1  m 

Z  fj  <  °-5  and  Z  fj  -  °-5  (3-6) 

j= 1  j= 1 

hold.  Then,  we  can  determine  the  median  as 

0.5  -  XTJ,1  fi 

+.5  =  em - 1  H - - - dm  (3.7) 

Jm 

where  em-\  denotes  the  lower  limit  of  the  interval  Km  and  dm  is  the  width  of  the 
interval  Km . 

Example  3.1.4  Recall  Example  3.1.1  where  we  looked  at  the  grouped  temperature 
data: 


Class  intervals 

<20 

(20-25] 

(25,  30] 

(30,  35] 

>35 

nj 

n\  =  0 

112  =  12 

+3  =  18 

^4  =  1 

ns  =  0 

fi 

/1  =  0 

f  12 

72  —  31 

/•  18 
73  —  31 

> 

II 

0 

II 

Hj  fj 

0 

XT' 

31 

30 

31 

1 

1 

For  the  third  class  (m  =  3),  we  have 

m  —  1 


Z  fi  = 

7  —  1 


12 

3l 


<  0.5  and 


We  can  therefore  calculate  the  median  as 

<m  — 1 


*0.5  —  em-\  + 


o.5  -  Z7=7  fi 


m 


Z  //  = 


7  =  1 


30 

31 


>  0.5. 


fm 


dm  — 


0.5  -  £ 

25  4 _ 21 

zj  -h 

31 


5  «  25.97. 
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3  Measures  of  Central  Tendency  and  Dispersion 


(a)  Symmetric  data 


(b)  Skewed  data 


(c)  Bimodal  data 


(d)  Data  with  outliers 


Fig.  3.1  Arithmetic  mean  and  median  for  different  data 


Comparing  the  Mean  with  the  Median.  In  the  above  examples,  the  mean  and  the 
median  turn  out  to  be  quite  similar  to  each  other.  This  is  because  we  looked  at  data 
which  is  symmetrically  distributed  around  its  centre,  i.e.  on  average,  we  can  expect 
26  °C  with  deviations  that  are  similar  above  and  below  the  average  temperature. 
A  similar  example  is  given  in  Fig.  3.1a:  we  see  that  the  raw  data  is  summarized  by 
using  ticks  at  the  bottom  of  the  graph  and  by  using  a  kernel  density  estimator.  The 
mean  and  the  median  are  similar  here  because  the  distribution  of  the  observations 
is  symmetric  around  the  centre.  If  we  have  skewed  data  (Fig.  3.1b),  then  the  mean 
and  the  median  may  differ.  If  the  data  has  more  than  one  centre,  such  as  in  Fig.  3.1c, 
neither  the  median  nor  the  mean  has  meaningful  interpretations.  If  we  have  outliers 
(Fig.  3.  Id),  then  it  is  wise  to  use  the  median  because  the  mean  is  sensitive  to  outliers. 
These  examples  show  that  depending  on  the  situation  of  interest  either  the  mean,  the 
median,  both  or  neither  of  them  can  be  useful. 


3.1  Measures  of  Central  Tendency 
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Quantiles.  Quantiles  are  a  generalization  of  the  idea  of  the  median.  The  median  is 
the  value  which  splits  the  data  into  two  equal  parts.  Similarly,  a  quantile  partitions 
the  data  into  other  proportions.  For  example,  a  25  %-quantile  splits  the  data  into  two 
parts  such  that  at  least  25  %  of  the  values  are  less  than  or  equal  to  the  quantile  and  at 
least  75  %  of  the  values  are  greater  than  or  equal  to  the  quantile.  In  general,  let  a  be  a 
number  between  zero  and  one.  The  ( a  x  100)  %-quantile,  denoted  as  xa,  is  defined  as 
the  value  which  divides  the  data  in  proportions  of  ( a  x  100)%  and  (1  —  a)  x  100% 
such  that  at  least  a  x  100  %  of  the  values  are  less  than  or  equal  to  the  quantile  and 
at  least  (1  —  a)  x  100%  of  the  values  are  greater  than  or  equal  to  the  quantile.  In 
terms  of  the  empirical  cumulative  distribution  function,  we  can  write  F(xa)  =  a.  It 
follows  immediately  that  for  n  observations,  at  least  na  values  are  less  than  or  equal 
to  xa  and  at  least  n(  1  —  a)  observations  are  greater  than  or  equal  to  xa.  The  median 
is  the  50  %-quantile  *0.5.  If  a  takes  the  values  0.1,  0.2,  . . . ,  0.9,  the  quantiles  are 
called  deciles.  If  a  •  100  is  an  integer  number  (e.g.  a  x  100  =  95),  the  quantiles  are 
called  percentiles,  i.e.  the  data  is  divided  into  100  equal  parts.  If  a  takes  the  values 
0.2,  0.4,  0.6,  and  0.8,  the  quantiles  are  known  as  quintiles  and  they  divide  the  data 
into  five  equal  parts.  If  a  takes  the  values  0.25,  0.5,  and  0.75,  the  quantiles  are  called 
quartiles. 

Considers  ordered  observations  V(i)  <  x@)  <  •  •  •  <  v^.Theo'  •  100  %-quantile 
xa  is  calculated  as 


a 


X(k)  if  na  is  not  an  integer  number, 

choose  k  as  the  smallest  integer  >  na , 
\  (x(na)  +  X(„a+ 1))  if  not  is  an  integer. 


(3.8) 


Example  3.1.5  Recall  Examples  3.0. 1-3. 1.4  where  we  evaluated  the  temperature  in 
Bangkok  in  December.  The  ordered  values  *(*•),  i  =  1,  2, . . . ,  31  are  as  follows: 


°c 

21 

22 

22 

23 

24 

24 

25 

25 

25 

25 

25 

25 

26 

26 

26 

26 

(i) 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

°c 

27 

27 

27 

28 

28 

28 

29 

29 

29 

29 

29 

30 

30 

30 

31 

(i) 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

To  determine  the  quartiles,  i.e.  the  25,  50,  and  75  %  quantiles,  we  calculate  na  as 
31  -0.25  =  7.75,31  -0.5  =  15.5,  and  31  -0.75  =  23.25.  Using  (3.8),  it  follows  that 

*0.25  =  *(8)  =  25,  Vo. 5  =  *(16)  =  26, 

*0.75  =  *(24)  =  29. 

In  R ,  we  obtain  the  same  results  using  the  quantile  function.  The  probs  argument 
is  used  to  specify  a.  By  default,  the  quartiles  are  reported. 

quantile (weather)  O 

quantile (weather ,  probs=c(0,0.25,0.5,0.75, 1)) 

However,  please  note  that  R  offers  nine  different  ways  to  obtain  quantiles,  each 
of  which  can  be  chosen  by  the  type  argument.  See  Hyndman  and  Fan  (1996)  for 
more  details. 


44 


3  Measures  of  Central  Tendency  and  Dispersion 


(a)  QQ-plot  for  Luigi  and  Domenico  (b)  QQ-plot  for  Mario  and  Salvatore 
Fig.  3.2  QQ-plots  for  the  pizza  delivery  time  for  different  drivers 

3.1 .3  Quantile-Quantile  Plots  (QQ-Plots) 

If  we  plot  the  quantiles  of  two  variables  against  each  other,  we  obtain  a  Quantile- 
Quantile  plot  (QQ-plot).  This  provides  a  simple  summary  of  whether  the  distributions 
of  the  two  variables  are  similar  with  respect  to  their  location  or  not. 

Example  3.1.6  Consider  again  the  pizza  data  which  is  described  in  Appendix  A.4.  We 
may  be  interested  in  the  delivery  time  for  different  drivers  to  see  if  their  performance 
is  the  same.  Figure 3.2a  shows  a  QQ-plot  for  the  delivery  time  of  driver  Luigi  and 
the  delivery  time  of  driver  Domenico.  Each  point  refers  to  the  a%  quantile  of  both 
drivers.  If  the  point  lies  on  the  bisection  line,  then  they  are  identical  and  we  conclude 
that  the  quantiles  of  the  both  drivers  are  the  same.  If  the  point  is  below  the  line,  then 
the  quantile  is  higher  for  Luigi,  and  if  the  point  is  above  the  line,  then  the  quantile  is 
lower  for  Luigi.  So  if  all  the  points  lie  exactly  on  the  line,  we  can  conclude  that  the 
distributions  of  both  the  drivers  are  the  same.  We  see  that  all  the  reported  quantiles 
lie  below  the  line,  which  implies  that  all  the  quantiles  of  Luigi  have  higher  values 
than  those  of  Domenico.  This  means  that  not  only  on  an  average,  but  also  in  general, 
the  delivery  times  are  higher  for  Luigi.  If  we  look  at  two  other  drivers,  as  displayed 
in  Fig.  3.2b,  the  points  lie  very  much  on  the  bisection  line.  We  can  therefore  conclude 
that  the  delivery  times  of  these  two  drivers  do  not  differ  much. 

In  R,  we  can  generate  QQ-plots  by  using  the  qqplot  command: 

qqplotO 
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(a)  Setting  1 


(b)  Setting  2 


Fig.  3.3  Different  patterns  for  a  QQ-plot 


As  a  summary,  let  us  consider  four  important  patterns: 


(a)  If  all  the  pairs  of  quantiles  lie  (nearly)  on  a  straight  line  at  an  angle  of  45  %  from 
the  v-axis,  then  the  two  samples  have  similar  distributions  (Fig.  3.3a). 

(b)  If  the  y-quantiles  are  lower  than  the  v -quantiles,  then  the  y-values  have  a  ten¬ 
dency  to  be  lower  than  the  v -values  (Fig.  3.3b). 

(c)  If  the  v -quantiles  are  lower  than  the  y-quantiles,  then  the  v -values  have  a  ten¬ 
dency  to  be  lower  than  the  y-values  (Fig.  3.3c). 

(d)  If  the  QQ-plot  is  like  Fig.  3.3d,  it  indicates  that  there  is  a  break  point  up  to  which 
the  y-quantiles  are  lower  than  the  v -quantiles  and  after  that  point,  the  y-quantiles 
are  higher  than  the  v -quantiles. 


3.1.4  Mode 

Consider  a  situation  in  which  an  ice  cream  shop  owner  wants  to  know  which  flavour 
of  ice  cream  is  the  most  popular  among  his  customers.  Similarly,  a  footwear  shop 
owner  may  like  to  find  out  what  design  and  size  of  shoes  are  in  highest  demand.  To 
answer  this  type  of  questions,  one  can  use  the  mode  which  is  another  measure  of 
central  tendency. 

The  mode  xm  of  n  observations  x\,  X2,  . . . ,  xn  is  the  value  which  occurs  the 
most  compared  with  all  other  values,  i.e.  the  value  which  has  maximum  absolute 
frequency.  It  may  happen  that  two  or  more  values  occur  with  the  same  frequency  in 
which  case  the  mode  is  not  uniquely  defined.  A  formal  definition  of  the  mode  is 

xm  =  aj  rij  =  max  {n\,  ri2,  . . . ,  rik } .  (3.9) 

The  mode  is  typically  applied  to  any  type  of  variable  for  which  the  number  of  different 
values  is  not  too  large.  If  continuous  data  is  summarized  in  groups,  then  the  mode 
can  be  used  as  well. 

Example  3.1. 7  Recall  the  pizza  data  set  described  in  Appendix  A.4.  The  pizza  deliv¬ 
ery  service  has  three  branches,  in  the  East,  West,  and  Centre,  respectively.  Suppose 
we  want  to  know  which  branch  delivers  the  most  pizzas.  We  find  that  most  of  the  de¬ 
liveries  have  been  made  in  the  West,  see  Fig.  3.4a;  therefore  the  mode  is  xm  =  West. 
Similarly,  suppose  we  also  want  to  find  the  mode  for  the  categorized  pizza  deliv¬ 
ery  time:  if  we  group  the  delivery  time  in  intervals  of  5  min,  then  we  see  that  the 
most  frequent  delivery  time  is  the  interval  “30— 35”  min,  see  Fig.  3.4b.  The  mode  is 
therefore  xm  =  [30,  35). 
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o 


Branch 


(a)  Bar  chart  for  branch 
Fig.  3.4  Results  from  the  pizza  data  set 


CD 


10  20  30  40  50  60 

Delivery  Time  (in  min) 

(b)  Histogram  for  grouped  delivery  time 


3.1 .5  Geometric  Mean 


Consider  n  observations  x\,  X2,  . . . ,  xn  which  are  all  positive  and  collected  on  a 
quantitative  variable.  The  geometric  mean  xq  of  this  data  is  defined  as 


xg 


(3.10) 


The  geometric  mean  plays  an  important  role  in  fields  where  we  are  interested  in 
products  of  observations,  such  as  when  we  look  at  percentage  changes  in  quantities. 
We  illustrate  its  interpretation  and  use  by  looking  at  the  average  growth  of  a  quantity 
in  the  sense  that  we  allow  a  starting  value,  such  as  a  certain  amount  of  money  or  a 
particular  population,  to  change  over  time.  Suppose  we  have  a  starting  value  at  some 
baseline  time  point  0  (zero),  which  may  be  denoted  as  Bo.  At  time  t ,  this  value  may 
have  changed  and  we  therefore  denote  it  as  Bt,  t  =  1,  2,  . . . ,  T.  The  ratio  of  Bt  and 
Bt—u 

Bt 


is  called  the  t th  growth  factor.  The  growth  rate  rt  is  defined  as 


rt  =  ((xt  -  1)  •  100)  % 


and  gives  us  an  idea  about  the  growth  or  decline  of  our  value  at  time  t.  We  can 
summarize  these  concepts  in  the  following  table: 
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Time  Inventory 

Growth  factor 

Growth  rate 

t  B, 

xt 

rt 

o  Bo 

— 

— 

1  Bi 

x\  —  B\/Bo 

^2 

k— 

K— ^ 

o 

o 

'S 

2  B2 

X2  =  B2/B\ 

((*2  -  1)  *  100)  % 

T  Bt 

xt  —  Bt  /  Bt — \ 

^2 

o 

o 

'S 

We  can  calculate  Bt  (t  =  1,  2, . . . ,  T)  by  using  the  growth  factors: 

Bf  =  Bq  ’  X\  •  X2  '  ...  •  Xf . 

The  average  growth  factor  from  Z?o  to  Bj  is  the  geometric  mean  or  geometric  average 
of  the  growth  factors: 

Xq  = 


Therefore,  Bt  at  time  t  can  be  calculated  as  Bt  —  B{)  Xq  . 

Example  3.1.8  Suppose  someone  wants  to  deposit  money,  say  €1000,  in  a  bank. 
The  bank  advisor  proposes  a  5-year  savings  plan  with  the  following  plan  for  interest 
rates:  1  %  in  the  first  year,  1.5  %  in  the  second  year,  2.5  %  in  the  third  year,  and  3  % 
in  the  last  2  years.  Now  he  would  like  to  calculate  the  average  growth  factor  and 
average  growth  rate  for  the  invested  money.  The  concept  of  the  geometric  mean  can 
be  used  as  follows: 


Year 

Euro 

Growth  factor  Growth  rate  (%) 

0 

1000 

— 

— 

1 

1010 

1.01 

1.0 

2 

1025.15 

1.015 

1.5 

3 

1050.78 

1.025 

2.5 

4 

1082.30 

1.03 

3.0 

5 

1114.77 

1.03 

3.0 

The  geometric  mean  is  calculated  as 

*G  =  (1.01  •  1.015  •  1.025  •  1.03  •  1.03)5  =  1.021968 

which  means  that  he  will  have  on  average  about  2.2  %  growth  per  year.  The  savings 
after  5  years  can  be  calculated  as 

€  1000  •  1.0219685  =  €  1114.77. 

It  is  easy  to  compare  two  different  saving  plans  with  different  growth  strategies  using 
the  geometric  mean. 


Tjx\  •  X2  *  .  .  .  *  Xt 
Tj  Bo  •  X\  •  X2  •  .  .  .  •  XT 

Bo 


Bj 


(3.11) 
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3.1 .6  Harmonic  Mean 


The  harmonic  mean  is  typically  used  whenever  different  Xi  contribute  to  the  mean 
with  a  different  weight  Wi ,  i.e.  when  we  implicitly  assume  that  the  weight  of  each 
Xi  is  not  one.  It  can  be  calculated  as 

-  _  w\  +  W2  H - \~  Wk  _  X/  =  1  wi  n 

Xh  —  W2  , _ I  m  ~ 

XI  —  X2  Xk  2^i  =  1  Xi 

For  example,  when  calculating  the  average  speed,  each  weight  relates  to  the  rela¬ 
tive  distance  travelled,  nt /n,  with  speed  v/.  Using  Wi  =  rii /n  and  X/  wi  —  2/  ni  / 
n  =  1,  the  harmonic  mean  can  be  written  as 


xr 


1 


Zk  Wi 

i= 1 


(3.13) 


Example  3.1.9  Suppose  an  investor  bought  shares  worth  €1000  for  two  consecutive 
months.  The  price  for  a  share  was  €50  in  the  first  month  and  €200  in  the  second 
month.  What  is  the  average  purchase  price?  The  number  of  shares  purchased  in 
the  first  month  is  1000/50  =  20.  The  number  of  shares  purchased  in  the  second 
month  is  1000/200  =  5.  The  total  number  of  shares  purchased  is  thus  20  +  5  =  25, 
and  the  total  investment  is  €2000.  It  is  evident  that  the  average  purchase  price  is 
2000/25  =  €80.  This  is  in  fact  the  harmonic  mean  calculated  as 

1 

XH  0.5  |  0.5  —  80 
50  200 

because  the  weight  of  each  purchase  is  rii/n  =  1000/2000  =  0.5.  If  the  investment 
was  €1200  in  the  first  month  and  €800  in  the  second  month,  then  we  could  use  the 
harmonic  mean  with  weights  1200/2000  =  0.6  and  800/2000  =  0.4,  respectively, 
to  obtain  the  results. 


3.2  Measures  of  Dispersion 

Measures  of  central  tendency,  as  introduced  earlier,  give  us  an  idea  about  the  loca¬ 
tion  where  most  of  the  data  is  concentrated.  However,  two  different  data  sets  may 
have  the  same  value  for  the  measure  of  central  tendency,  say  the  same  arithmetic 
means,  but  they  may  have  different  concentrations  around  the  mean.  In  this  case,  the 
location  measures  may  not  be  adequate  enough  to  describe  the  distribution  of  the 
data.  The  concentration  or  dispersion  of  observations  around  any  particular  value  is 
another  property  which  characterizes  the  data  and  its  distribution.  We  now  introduce 
statistical  methods  which  describe  the  variability  or  dispersion  of  data. 


3.2  Measures  of  Dispersion 
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Example  3.2.1  Suppose  three  students  Christine,  Andreas,  and  Sandro  arrive  at  dif¬ 
ferent  times  in  the  class  to  attend  their  lectures.  Let  us  look  at  their  arrival  time  in 
the  class  after  or  before  the  starting  time  of  lecture,  i.e.  let  us  look  how  early  or  late 
they  were  (in  minutes). 


Week 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Christine 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Andreas 

-10  +10 

-10  +10 

-10  +10 

-10  +10 

-10  +10 

Sandro 

3 

5 

6 

2 

4 

6 

8 

4 

5 

7 

We  see  that  Christine  always  arrives  on  time  (time  difference  of  zero).  Andreas  arrives 
sometimes  10  min  early  and  sometimes  10  min  late.  However,  the  arithmetic  mean  of 
both  students  is  the  same — on  average,  they  both  arrive  on  time!  This  interpretation 
is  obviously  not  meaningful.  The  difference  between  both  students  is  the  variability 
in  arrival  times  that  cannot  be  measured  with  the  mean  or  median.  For  this  reason,  we 
need  to  introduce  measures  of  dispersion  (variability).  With  the  knowledge  of  both 
location  and  dispersion,  we  can  give  a  much  more  nuanced  comparison  between  the 
different  arrival  times.  For  example,  consider  the  third  student  Sandro.  He  is  always 
late;  sometimes  more,  sometimes  less.  However,  while  on  average  he  comes  late,  his 
behaviour  is  more  predictable  than  that  of  Andreas.  Both  location  and  dispersion  are 
needed  to  give  a  fair  comparison. 

Example  3.2.2  Consider  another  example  in  which  a  supplier  for  the  car  industry 
needs  to  deliver  10  car  doors  with  an  exact  width  of  1 .00  m.  He  supplies  5  doors  with 
a  width  of  1.05  m  and  the  remaining  5  doors  with  a  width  of  0.95  m.  The  arithmetic 
mean  of  all  the  10  doors  is  1.00  m.  Based  on  the  arithmetic  mean,  one  may  conclude 
that  all  the  doors  are  good  but  the  fact  is  that  none  of  the  doors  are  usable  as  they  will 
not  fit  into  the  car.  This  knowledge  can  be  summarized  by  a  measure  of  dispersion. 

The  above  examples  highlight  that  the  distribution  of  a  variable  needs  to  be  char¬ 
acterized  by  a  measure  of  dispersion  in  addition  to  a  measure  of  location  (central 
tendency).  Now  we  introduce  various  measures  of  dispersion. 


3.2.1  Range  and  Interquartile  Range 


Consider  a  variable  X  with  n  observations  x\ ,  X2,  •  • . ,  xn .  Order  these  n  observations 
as  Jt(i)  <  X(2)  <  •  •  •  <  X(n).  The  range  is  a  measure  of  dispersion  defined  as  the 
difference  between  the  maximum  and  minimum  value  of  the  data  as 


^  —  X(n)  -^(1)  • 


(3.14) 


The  interquartile  range  is  defined  as  the  difference  between  the  75th  and  25th 
quartiles  as 


dQ  =  Xo.75  —  *0.25- 


(3.15) 


It  covers  the  centre  of  the  distribution  and  contains  50  %  of  the  observations. 
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Remark  3.2.1  Note  that  the  interquartile  range  is  defined  as  the  interval  [xo.25 ;  *0.75] 
in  some  literature.  However,  in  line  with  most  of  the  statistical  literature,  we  define 
the  interquartile  range  to  be  a  measure  of  dispersion,  i.e.  the  difference  between  io.75 
and  vq.25- 


Example  3.2.3  Recall  Examples  3. 0.1-3. 1.5  where  we  looked  at  the  temperature  in 
Bangkok  during  December.  The  ordered  values  x#),  i  =  1, . . . ,  31,  are  as  follows: 


°c 

21 

22 

22 

23 

24 

24 

25 

25 

25 

25 

25 

25 

26 

26 

26 

26 

(i) 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

°c 

27 

27 

27 

28 

28 

28 

29 

29 

29 

29 

29 

30 

30 

30 

31 

(i) 

17 

18 

19 

20 

21 

22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

We  obtained  the  quantiles  in  Example  3.1.5  as  vo.25  =  25  and  vo.75  =  29.  The  in¬ 
terquartile  range  is  therefore  cIq  =29  —  25  =4,  which  means  that  50  %  of  the  data 
is  centred  between  25  and  29  °C.  The  range  is  R  =  31  —  21  =  10  °C,  meaning  that 
the  temperature  is  varying  at  most  by  10  °C.  In  R ,  there  are  several  ways  to  ob¬ 
tain  quartiles,  minimum  and  maximum  values,  e.g.  by  using  min,  max,  quantiles, 
range,  among  others.  All  numbers  can  be  easily  obtained  by  the  summary  command 
which  we  recommend  using. 

summary (weather ) 


3.2.2  Absolute  Deviation,  Variance,  and  Standard  Deviation 

Another  measure  of  dispersion  is  the  variance.  The  variance  is  one  of  the  most 
important  measures  in  statistics  and  is  needed  throughout  this  book.  We  use  the  idea  of 
“absolute  deviation”  to  give  some  more  background  and  motivation  for  understanding 
the  variance  as  a  measure  of  dispersion,  followed  by  some  examples. 

Consider  the  deviations  of  n  observations  around  a  certain  value  “A”  and  combine 
them  together,  for  instance,  via  the  arithmetic  mean  of  all  the  deviations: 

1 

D  =  - 
n 

This  measure  has  the  drawback  that  the  deviations  (A;  —  A),  i  =  1,2 , ...  ,n,  can  be 
either  positive  or  negative  and,  consequently,  their  sum  can  potentially  be  very  small 
or  even  zero.  Using  D  as  a  measure  of  variability  is  therefore  not  a  good  idea  since 
D  may  be  small  even  for  a  large  variability  in  the  data. 


-  A) 


(3.16) 


3.2  Measures  of  Dispersion 


51 


Using  absolute  values  of  the  deviations  solves  this  problem,  and  we  introduce  the 
following  measure  of  dispersion: 


(3.17) 


It  can  be  shown  that  the  absolute  deviation  attains  its  minimum  when  A  corresponds 
to  the  median  of  the  data: 


1  n 

D(xo.5)  =  -  Y  k/  -*o.5|. 

n  “ 

i= l 


(3.18) 


We  call  D(vo.5)  the  absolute  median  deviation.  When  A  =  x,  we  speak  of  the 

absolute  mean  deviation  given  by 


(3.19) 


Another  solution  to  avoid  the  positive  and  negative  signs  of  deviation  in  (3.16)  is  to 
consider  the  squares  of  deviations  Xi  —  A,  rather  than  using  the  absolute  value.  This 
provides  another  measure  of  dispersion  as 


s2(A) 


n 

-  Y(x,  -  A)2 

n  z — ' 

i=\ 


(3.20) 


which  is  known  as  the  mean  squared  error  (MSE)  with  respect  to  A.  The  MSE 
is  another  important  measure  in  statistics,  see  Chap.  9,  Eq.  (9.4),  for  details.  It  can 
be  shown  that  s2(A )  attains  its  minimum  value  when  A  =  x.  This  is  the  (sample) 

variance 

1  n 

~s2  =  -  V  (xt  -  x)2.  (3.21) 

n  f-' 

/  —  I 

After  expanding  s 2,  we  can  write  (3.21)  as 


s 


2 


(3.22) 


The  positive  square  root  of  the  variance  is  called  the  (sample)  standard  deviation, 
defined  as 


s 


N 


i  n 

-x)2, 

n  i=\ 


(3.23) 


The  standard  deviation  has  the  same  unit  of  measurement  as  the  data  whereas  the 
unit  of  the  variance  is  the  square  of  the  units  of  the  observations.  For  example,  if  X  is 
weight,  measured  in  kg,  then  v  and  s  are  also  measured  in  kg,  while  s 2  is  measured 
in  kg2  (which  may  be  more  difficult  to  interpret).  The  variance  is  a  measure  which 
we  use  in  other  chapters  to  obtain  measures  of  association  between  variables  and  to 
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draw  conclusions  from  a  sample  about  a  population  of  interest;  however,  the  standard 
deviation  is  typically  preferred  for  a  descriptive  summary  of  the  dispersion  of  data. 

The  standard  deviation  measures  how  much  the  observations  vary  or  how  they 
are  dispersed  around  the  arithmetic  mean.  A  low  value  of  the  standard  deviation 
indicates  that  the  values  are  highly  concentrated  around  the  mean.  A  high  value  of 
the  standard  deviation  indicates  lower  concentration  of  the  observations  around  the 
mean,  and  some  of  the  observed  values  may  even  be  far  away  from  the  mean.  If  there 
are  extreme  values  or  outliers  in  the  data,  then  the  arithmetic  mean  is  more  sensitive 
to  outliers  than  the  median.  In  such  a  case,  the  absolute  median  deviation  (3.18)  may 
be  preferred  over  the  standard  deviation. 


Example  3.2.4  Consider  again  Example  3.2.1  where  we  evaluated  the  arrival  times 
of  Christine,  Andreas,  and  Sandro  in  their  lecture.  Using  the  arithmetic  mean,  we 
concluded  that  both  Andreas  and  Christine  arrive  on  time,  whereas  Sandro  is  always 
late;  however,  we  saw  that  the  variation  of  arrival  times  differs  substantially  among 
the  three  students.  To  describe  and  quantify  this  variability  formally,  we  calculate 
the  variance  and  absolute  median  deviation: 

1  10  1 

Sc  =  io  !>'  -  *)2  =  JQ  ((0  -  0)2  +  •  •  •  +  (0  -  0)2)  =  0 

i  =  1 

i  10  i 

s\  =  ^  YSXi  -  X)1  =  I^((_1°  -  0)2  H - h  (10  -  0)2)  »  111.1 

i  =  1 

i  10  i 

S2S  =  —  -  *)2  =  -  5)2  H - h  (7  -  5)2)  »  3.3 


D(xo.5,c )  —  ~  y.  I Xi  —  X0.5I  —  |0  —  0|  +  •  •  •  +  |0  —  0|  —  0 

i= 1 

1  n 

D(xo.5 ,a)  =  —  |*i  —  *0.5 1  =  |  —  10  —  0|  +  •  •  •  +  |10  —  0|  =  10 

1  =  1 

1  n 

D(xo.5,s)  =  —  \xi  ~  *0.5 1  —  |3  —  5|  +  •  •  •  +  |7  —  5|  =  1.4. 

1  =  1 

We  observe  that  the  variation/dispersion/variability  is  the  lowest  for  Christine  and 
highest  for  Andreas.  Both  median  absolute  deviation  and  variance  allow  a  comparison 
between  the  two  students.  If  we  take  the  square  root  of  the  variance,  we  obtain  the 
standard  deviation.  For  example,  Ss  =  V33  ~  1.8,  which  means  that  the  average 
difference  of  the  observations  from  the  arithmetic  mean  is  1.8. 

In  R ,  we  can  use  the  var  command  to  calculate  the  variance.  However,  note 
that  R  uses  l/(n  —  1)  instead  of  l/n  in  calculating  the  variance.  The  idea  behind 
the  multiplication  by  \/(n  —  1)  in  place  of  l/n  is  discussed  in  Chap. 9,  see  also 
Theorem  9.2.1. 
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Variance  for  Grouped  Data.  The  variance  for  grouped  data  can  be  calculated  using 


(3.24) 


where  a  j  is  the  middle  value  of  the  j  th  interval.  However,  when  the  data  is  artificially 
grouped  and  the  knowledge  about  the  original  ungrouped  data  is  available,  we  can 
also  use  the  arithmetic  mean  of  the  jth  class: 

i  k 

sj  =  -  /  ,  n  i (x j  -x)2.  (3.25) 

7  =  1 

The  two  expressions  (3.24)  and  (3.25)  represent  the  variance  between  the  different 
classes,  i.e.  they  describe  the  variability  of  the  class  specific  means  xj ,  weighted  by 
the  size  of  each  class  nj ,  around  the  overall  mean  x.  It  is  evident  that  the  variance 
within  each  class  is  not  taken  into  account  in  these  formulae.  The  variability  of 
measurements  in  each  class,  i.e.  the  variability  of  Vx;  e  Kj ,  is  another  important 
component  to  determine  the  overall  variance  in  the  data.  It  is  therefore  not  surprising 
that  using  only  the  between  variance  will  underestimate  the  total  variance  and 
therefore 

Sfr  <  s2 .  (3.26) 


If  the  data  within  each  class  is  known,  we  can  use  the  Theorem  of  Variance 
Decomposition  (see  p.  136  for  the  theoretical  background)  to  determine  the  variance. 
This  allows  us  to  represent  the  total  variance  as  the  sum  of  the  variance  between 
the  different  classes  and  the  variance  within  the  different  classes  as 


2  =  l  HnJ^j  ~*)2+l  ■ 

7=1  7=1 


(3.27) 


between 


within 


In  (3.27),  s2  is  the  variance  of  the  jth  class: 


f  =  j:  Z  (x'  -  *;)2' 


(3.28) 


%i  €  Kj 


The  proof  of  (3.27)  is  given  in  Appendix  C.l,  p.  423. 


Example  3.2.5  Recall  the  weather  data  used  in  Examples  3.0. 1-3. 2. 3  and  the  grouped 
data  specified  as  follows: 


Class  intervals ; 

<20 

(20-25] 

(25,  30] 

(30,  35] 

>35 

nJ 

n\  =  0 

(N 

H 

II 

CN 

S 

^3  =  18 

Ft  4  =  1 

n5  =0 

*j 

— 

23.83 

28 

31 

— 

— 

1.972 

2 

0 

— 
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We  know  that  v  =  26.48  and  n  =  31.  The  first  step  is  to  calculate  the  mean  and 
variances  in  each  class  using  (3.28).  We  then  obtain  x;  and  s2  as  listed  above.  The 
within  and  between  variances  are  as  follows: 


1 

n 


k 


7  =  1 


1 

n 


1 

—  (12  •  1.972  +  18*2-1-1*0)  ~  1.925 
31 

-^(12  •  [23.83  -  26. 48]2  +  18  •  [28  -  26.48]2 
+  1  •  [31  -26.48]2)  ~  4.71. 


The  total  variance  is  therefore  s2  ~  6.64.  Estimating  the  variance  using  all  31  ob¬ 
servations  would  yield  the  same  results.  However,  it  becomes  clear  that  without 
knowledge  about  the  variance  within  each  class,  we  cannot  reliably  estimate  s2.  In 
the  above  example,  the  variance  between  the  classes  is  3  times  lower  than  the  total 
variance  which  is  a  serious  underestimation. 


Linear  Transformations.  Let  us  consider  a  linear  transformation  yt  =  a  +  bxi 
(b  /  0)  of  the  original  data  jq,  (i  =  1,2,...,  /i).  We  get  the  arithmetic  mean  of 
the  transformed  data  as  y  =  a  +  bx  and  for  the  variance: 

?2  =  1  £(w  -  y)2  =  -  j[>  -  i)2 

n  A — '  n  z — ' 

i  =  \  i  =  1 

=  b2s2.  (3.29) 

Example  3.2.6  Let  v; ,  i  =  1,2 , ...  ,n,  denote  measurements  on  time.  These  data 
could  have  been  recorded  and  analysed  in  hours,  but  we  may  be  interested  in  a 
summary  in  minutes.  We  can  make  a  linear  transformation  y;  =  60  v; .  Then,  y  =  60x 
and  s2  =  602£2.  If  the  mean  and  variance  of  the  Xi  ’s  have  already  been  obtained,  then 
the  mean  and  variance  of  the  y*  ’s  can  be  obtained  directly  using  these  transformations. 


Standardization.  A  variable  is  called  standardized  if  its  mean  is  zero  and  its  variance 
is  1.  Standardization  can  be  achieved  by  using  the  following  transformation: 


Xi  —  x  x  1 

yi  =  — - —  =  —  - — b  —Xi  =  a  +  bxi . 

$x  $x 


(3.30) 


It  follows  that  y  =  ^Ji= l  (xi  ~  *)/sx  —  0  and  =  X?=i(- xi  ~  ^)2/^x  =  1*  There 
are  many  statistical  methods  which  require  standardization,  see,  for  example, 
Sect.  10.3.1  for  details  in  the  context  of  statistical  tests. 


Example  3.2. 7  Let  X  be  a  variable  which  measures  air  pollution  by  using  the  concen¬ 
tration  of  atmospheric  particulate  matter  (in  |jig/m3).  Suppose  we  have  the  following 
10  measurements: 
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30  25  12  45  50  52  38  39  45  33. 

We  calculate  x  =  36.9 ,s%  =  136.09,  and  sx  =  11.67.  To  get  a  standardized  vari¬ 
able  Y,  we  transform  all  the  observations  Xi  ’s  as 


Xi  -  x 
yt  =  — — 


l 

+  —Xi 
Sx 


36.9  1 

-  T  - Xi 

11.67  11.67 


—3.16  +  0.086v;. 


Now  y\  ~  —3.16  +  0.086  •  30  =  —0.58,  y 2  =  —3.16  +  0.086  •  25  =  —1.01,  . . ., 
are  the  standardized  observations.  The  scale  command  in  R  allows  standard¬ 
ization,  and  we  can  obtain  the  standardized  observations  corresponding  to  the 
10  measurements  as 

air  <-  c(30, 25, 12,45,50,52,38,39,45,33) 

IHk 

scale (air) 


Please  note  that  the  scale  command  uses  \/(n  —  1)  for  calculating  the  variance, 
as  already  outlined  above.  Thus,  the  results  provided  by  scale  are  not  identical  to 
those  using  (3.30). 


3.2.3  Coefficient  of  Variation 


Consider  a  situation  where  two  different  variables  have  arithmetic  means  x\  and  X2 
with  standard  deviations  £1  and  S2,  respectively.  Suppose  we  want  to  compare  the 
variability  of  hotel  prices  in  Munich  (measured  in  euros)  and  London  (measured 
in  British  pounds).  How  can  we  provide  a  fair  comparison?  Since  the  prices  are 
measured  in  different  units,  and  therefore  likely  have  arithmetic  means  which  differ 
substantially,  it  does  not  make  much  sense  to  compare  the  standard  deviations  di¬ 
rectly.  The  coefficient  of  variation  v  is  a  measure  of  dispersion  which  uses  both  the 
standard  deviation  and  mean  and  thus  allows  a  fair  comparison.  It  is  properly  defined 
only  when  all  the  values  of  a  variable  are  measured  on  a  ratio  scale  and  are  positive 
such  that  x  >0  holds.  It  is  defined  as 


v  =  — . 
x 


(3.31) 


The  coefficient  of  variation  is  a  unit-free  measure  of  dispersion.  It  is  often  used  when 
the  measurements  of  two  variables  are  different  but  can  be  put  into  relation  by  using  a 
linear  transformation  y;  =  bxi .  It  is  possible  to  show  that  if  all  values  Xi  of  a  variable 
X  are  transformed  into  a  variable  Y  with  values  yt  =  b  •  Xi,  b  >  0,  then  v  does  not 
change. 
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Example  3.2.8  If  we  want  to  compare  the  variability  of  hotel  prices  in  two  selected 
cities  in  Germany  and  England,  we  could  calculate  the  mean  prices,  together  with 
their  standard  deviation.  Suppose  a  sample  of  prices  of  say  100  hotels  in  two  selected 
cities  in  Germany  and  England  is  available  and  suppose  we  obtain  the  mean  and 
standard  deviations  of  the  two  cities  as  x\  =  €130,  X2  =  £230,  s\  =  €99,  and  S2  = 
£212.  Then,  v\  =  99/130  ~  0.72  and  V2  =  212/230  =  0.92.  This  indicates  higher 
variability  in  hotel  prices  in  England.  However,  if  the  data  distribution  is  skewed 
or  bimodal,  then  it  may  be  wise  not  to  choose  the  arithmetic  mean  as  a  measure  of 
central  tendency  and  likewise  the  coefficient  of  variation. 


3.3  Box  Plots 


So  far  we  have  described  various  measures  of  central  tendency  and  dispersion.  It  can 
be  tedious  to  list  those  measures  in  summary  tables.  A  simple  and  powerful  graph 
is  the  box  plot  which  summarizes  the  distribution  of  a  continuous  (or  sometimes  an 
ordinal)  variable  by  using  its  median,  quartiles,  minimum,  maximum,  and  extreme 
values. 

Figure  3.5a  shows  a  typical  box  plot.  The  vertical  length  of  the  box  is  the  in¬ 
terquartile  range  cIq  =  vo.75  —  vo.25,  which  shows  the  region  that  contains  50%  of 
the  data.  The  bottom  end  of  the  box  refers  to  the  first  quartile,  and  the  top  end  of  the 
box  refers  to  the  third  quartile.  The  thick  line  in  the  box  is  the  median.  It  becomes 
immediately  clear  that  the  box  indicates  the  symmetry  of  the  data:  if  the  median  is 
in  the  middle  of  the  box,  the  data  should  be  symmetric,  otherwise  it  is  skewed.  The 
whiskers  at  the  end  of  the  plot  mark  the  minimum  and  maximum  values  of  the  data. 
Looking  at  the  box  plot  as  a  whole  tells  us  about  the  data  distribution  and  the  range 
and  variability  of  observations.  Sometimes,  it  may  be  advisable  to  understand  which 
values  are  extreme  in  the  sense  that  they  are  “far  away”  from  the  centre  of  the  distri¬ 
bution.  In  many  software  packages,  including  R ,  values  are  defined  to  be  extreme  if 
they  are  greater  than  1.5  box  lengths  away  from  the  first  or  third  quartile.  Sometimes, 
they  are  called  outliers.  Outliers  and  extreme  values  are  defined  differently  in  some 
software  packages  and  books. 
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extreme  values 
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(a)  Box  plot  without  extreme  values 


(b)  Box  plot  with  extreme  values 
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The  boxplot  command  in  R  draws  a  box  plot.  The  range  option  controls  whether 
extreme  values  should  be  plotted,  and  if  yes,  how  one  wants  to  define  such  values. 

boxplot (variable ,  range=1.5) 


Example  3.3.1  Recall  Examples  3.0. 1-3. 2. 5  where  we  looked  at  the  temperature  in 
Bangkok  during  December.  We  have  already  calculated  the  median  (26° C)  and  the 
quartiles  (25, 29° C).  The  minimum  and  maximum  values  are  21°C  and  3 1°C.  The  box 
plot  for  this  data  is  shown  in  Fig.  3.5a.  One  can  see  that  the  temperature  distribution 
is  slightly  skewed  with  more  variability  for  lower  temperatures.  The  interquartile 
range  is  4,  and  therefore,  any  value  >29  +  4  x  1.5  =  35  or  <25  —  4  x  1.5  =  19 
would  be  an  extreme  value.  However,  there  are  no  extreme  values  in  the  data. 

Example  3.3.2  Consider  again  the  pizza  data  described  in  Appendix  A. 4.  We  use  R 
to  plot  the  box  plot  for  the  delivery  time  via  boxplot  (time)  (Fig.  3.5b).  We  see 
a  symmetric  distribution  with  a  median  delivery  time  of  about  35  min.  Most  of  the 
deliveries  took  between  30  and  40  min.  The  extreme  values  indicate  that  there  were 
some  exceptionally  short  and  long  delivery  times. 
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(a)  Boxplot  for  weather  data 


(b)  Boxplot  for  pizza  data 


3.4  Measures  of  Concentration 

A  completely  different  concept  used  to  describe  a  quantitative  variable  is  the  idea  of 
concentration.  For  a  variable  X,  it  summarizes  the  proportion  of  each  observation 
with  respect  to  the  sum  of  all  observations  2>=i  xi  •  Let  us  look  at  a  simple  example 
to  demonstrate  its  usefulness. 
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Table  3.1  Concentration  of 
farmland:  two  different 
situations 


Farmer  (i) 

Xi  (Area,  in  hectare) 

1 

20 

2 

20 

3 

20 

4 

20 

5 

20 

zti  *1  =  100 

Farmer  (/) 

Xi  (Area,  in  hectare) 

1 

0 

2 

0 

3 

0 

4 

0 

5 

100 

M 

H— 

II 

o 

o 

Example  3.4.1  Consider  a  village  with  5  farms.  Each  farmer  has  a  farm  of  a  certain 
size.  How  can  we  evaluate  the  land  distribution?  Do  all  farmers  have  a  similar  amount 
of  land  or  do  one  or  two  farmers  have  a  big  advantage  because  they  have  considerably 
more  space? 

Table 3.1  shows  two  different  situations:  in  the  table  on  the  left,  we  see  an  equal 
distribution  of  land,  i.e.  each  farmer  owns  20  hectares  of  farmland.  This  means  X  is 
not  concentrated,  rather  it  is  equally  distributed.  A  statistical  function  describing  the 
concentration  could  return  a  value  of  zero  in  such  a  case.  Consider  another  extreme 
where  one  farmer  owns  all  the  farmland  and  the  others  do  not  own  anything,  as  shown 
on  the  right  side  of  Table  3.1.  This  is  an  extreme  concentration  of  land:  one  person 
owns  everything  and  thus,  we  say  the  concentration  is  high.  A  statistical  function 
describing  the  concentration  could  return  a  value  of  one  in  such  a  case. 


3.4.1  Lorenz  Curve 


The  Lorenz  curve  is  a  popular  method  to  display  concentrations  graphically.  Con¬ 
sider  n  observations  x\ ,  X2, . . . ,  xn  of  a  variable  X.  Assume  that  all  the  observations 
are  positive.  The  sum  of  all  the  observations  is  xi  —  nx  if  the  data  is  un¬ 
grouped.  First,  we  need  to  order  the  data:  0  <  V(i)  <  x@)  <  •  •  •  <  X(n).  To  plot  the 
Lorenz  curve,  we  need 

Uj  =  — ,  i  =  0,  . . . ,  n,  (3.32) 

n 


and 


Vi 


Zj= 1  Hi) 
Z"=i  *0)’ 


i  =  1,  ...  ,n;  no  :=  0, 


(3.33) 
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Fig.  3.5  Lorenz  curves  for  no  concentration  (left)  and  some  concentration  (right)* 

where  X}=i  x(j)  *s  the  cumulative  total  of  observations  up  to  the  i th  observation. 
The  idea  is  that  Vi  describe  the  contribution  of  all  values  <  i  in  comparison  with 
the  sum  of  all  values.  Plotting  Ui  against  Vi  for  all  i  shows  how  much  the  sum  of  all 
Xi,  for  all  observations  <  i,  contributes  to  the  total  sum.  In  other  words,  the  point 
(ui,  Vi)  says  that  w*  •  100%  of  observations  contain  Vi  •  100%  of  the  sum  of  all  xt 
less  than  or  equal  to  i .  Obviously,  if  all  xt  are  identical,  the  Lorenz  curve  will  be  a 
straight  diagonal  line,  also  known  as  the  identity  line  or  line  of  equality.  If  the  xL 
are  of  different  sizes,  then  the  Lorenz  curve  falls  below  the  line  of  equality.  This  is 
illustrated  in  the  following  example. 

Example  3.4. 2  Recall  Example  3 .4. 1  where  we  looked  at  the  distribution  of  farmland 
among  5  farmers.  On  the  upper  panel  of  Table  3.1,  we  observed  an  equal  distribution 
of  land  among  the  farmers:  x\  =  20,  X2  =  20,  x^  =  20,  X4  =  20,  and  V5  =20.  We 
obtain  u\  =  1/5,  U2  =  2/5, . . . ,  u$  =  1  and  v\  =  20/100,  V2  =  40/100,  . . . ,  V5  = 
1.  This  yields  a  Lorenz  curve  as  displayed  on  the  left  side  of  Fig.  3.5:  there  is  no 
concentration.  We  can  interpret  each  point.  For  example,  ( U2 ,  V2)  =  (0.4,  0.4)  means 
that  40  %  of  farmers  own  40  %  of  the  land. 

The  lower  panel  of  Table  3. 1  describes  the  situation  with  strong  concentration.  For 
this  table,  we  obtain  u  1  =  1/5,  U2  =  2/5,  . . . ,  U5  =  landiq  =  0,  V2  =  0,  . . . ,  V5  = 
1 .  Therefore,  for  example,  80  %  of  farmers  own  0  %  of  the  land  which  shows  strong 
inequality.  Most  often  we  do  not  have  such  extreme  situations.  In  this  case,  the  Lorenz 
curve  is  bent  towards  the  lower  right  corner  of  the  plot,  see  the  right  side  of  Fig.  3.5. 

We  can  plot  the  Lorenz  curve  in  R  using  the  Lc  command  in  the  library  ineq. 
The  Lorenz  curve  for  the  left  table  of  Example  3.4.1  is  plotted  in  R  as  follows: 

library (ineq) 
x  <-  c (20 ,20,20,20,20) 
plot (Lc (x) ) 


60 


3  Measures  of  Central  Tendency  and  Dispersion 


Fig.  3.6  Lorenz  curve  and 
the  Gini  coefficient* 


We  can  use  the  same  approach  as  above  to  obtain  the  Lorenz  curve  when  we  have 
grouped  data.  We  simply  describe  the  contributions  for  each  class  rather  than  for 
each  observation  and  approximate  the  values  in  each  class  by  using  its  mid-point. 
More  formally  we  can  write: 


Hi  =  fj,  i  =  1,  2,  . . . ,  k\  uo  :=  0 

7  =  1 


(3.34) 


and 


Vi  = 


Z/= l  fjaj  Z/= l  njaj 


1  ~ 


2  7  =  1  fjaj 


nx 


,  i  =  1,2,  vq:=0.  (3.35) 


3.4.2  Gini  Coefficient 

We  have  seen  in  Sect.  3.4. 1  that  the  Lorenz  curve  corresponds  to  the  identity  line,  that 
is  the  diagonal  line  of  equality,  for  no  concentration.  When  there  is  some  concentra¬ 
tion,  then  the  curve  deviates  from  this  line.  The  amount  of  deviation  depends  on  the 
strength  of  concentration.  Suppose  we  want  to  design  a  measure  of  concentration 
which  is  0  for  no  concentration  and  1  for  perfect  (i.e.  extreme)  concentration.  We  can 
simply  measure  the  area  between  the  Lorenz  curve  and  the  identity  line  and  multiply 
it  by  2.  For  no  concentration,  the  area  will  be  zero  and  hence  the  measure  will  be 
zero.  If  there  is  perfect  concentration,  then  the  curve  will  coincide  with  the  axes,  the 
area  will  be  close  to  0.5,  and  twice  the  area  will  be  close  to  one.  The  measure  based 
on  such  an  approach  is  called  the  Gini  coefficient: 

G  =  2  •  F.  (3.36) 

Note  that  F  is  the  area  between  the  curve  and  the  bisection  or  diagonal  line. 
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The  Gini  coefficient  can  be  estimated  by  adding  up  the  areas  of  the  trapeziums  G 
as  displayed  in  Fig.  3.6: 

n 

F  =  £  Fi  -  0.5, 

i=l 


where 


Mj-l  +  Ui 
2 


It  can  be  shown  that  this  corresponds  to 


Vi- 1). 


1  " 

G  =  1 - 5>-i  +  Ui), 

n  " 

/  — 1 


(3.37) 


but  the  proof  is  omitted.  The  same  formula  can  be  used  for  grouped  data  except  that 
v  is  used  instead  of  v.  Since 

n  —  1 


0  <  G  < 


n 


(3.38) 


one  may  prefer  to  use  the  standardized  Gini  coefficient 

n 


G+  = 


n  —  1 


G, 


(3.39) 


which  takes  a  maximum  value  of  1 


Example  3.4.3  We  return  to  our  farmland  example.  Suppose  we  have  7  farmers  with 
farms  of  different  sizes: 


Farmer  1  2  3  4  5  6  7 

Farmland  size  Xi  20  14  59  9  36  23  3 


Using  the  ordered  values,  we  can  calculate  Ui  and  ty-  using  (3.32)  and  (3.33): 


i  x(i)  Ui  Vi 


1 

3 

l 

7 

=  0.1429 

3 

164 

=  0.0183 

2 

9 

2 

7 

=  0.2857 

12 

164 

=  0.0732 

3 

14 

3 

7 

=  0.4286 

26 

164 

=  0.1585 

4 

20 

4 

7 

=  0.5714 

46 

164 

=  0.2805 

5 

23 

5 

7 

=  0.7143 

69 

164 

=  0.4207 

6 

36 

6 

7 

=  0.8571 

105 

164 

=  0.6402 

7 

59 

7 

7 

=  1.0000 

164 

164 

=  1.0000 
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Fig.  3.7  Lorenz  curve  for 
Example  3.4.3* 


The  Lorenz  curve  is  displayed  in  Fig.  3.7.  Using  this  information,  it  is  easy  to 
calculate  the  Gini  coefficient: 


1 

G  =  1  -  -(0.0183  +  [0.0183  +  0.0732]  +  [0.0732  +  0.1585]  +  [0.1585  +  0.2805] 

+[0.2805  +  0.4207]  +  [0.4207  +  0.6402]  +  [0.6402  +  1])  =  0.402 

We  know  that  G  =  0.4024  <  %  =  k  To  standardize  the  coefficient,  we  therefore 
have  to  use  (3.39): 

G+  =  —G  =  -  •  0.4024  =  0.4695  . 

6  6 

In  R,  we  can  obtain  the  non- standardized  Gini  Coefficient  using  the  ineq  function 
in  the  library  ineq. 


library (ineq) 

farm  <-  c(20, 14,59,9,36,23,3) 
ineq(f arm) 
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3.5  Key  Points  and  Further  Issues 


Note: 


/  A  summary  on  how  to  descriptively  summarize  data  is  given  in  Appen¬ 
dix  D.l. 

/  The  median  is  preferred  over  the  arithmetic  mean  when  the  data  distri¬ 
bution  is  skewed  or  there  are  extreme  values. 

/  If  data  of  a  continuous  variable  is  grouped,  and  the  original  ungrouped 
data  is  not  known,  additional  assumptions  are  needed  to  calculate  mea¬ 
sures  of  central  tendency  and  dispersion.  However,  in  some  cases,  these 
assumptions  may  not  be  satisfied,  and  the  formulae  provided  may  give 
imprecise  results. 

/  QQ-plots  are  not  only  descriptive  summaries  but  can  also  be  used  to  test 
modelling  assumptions,  see  Chap.  11.9  for  more  details. 

/  The  distribution  of  a  continuous  variable  can  be  easily  summarized  using 
a  box  plot. 


3.6  Exercises 

Exercise  3.1  A  hiking  enthusiast  has  a  new  app  for  his  smartphone  which  summa¬ 
rizes  his  hikes  by  using  a  GPS  device.  Let  us  look  at  the  distance  hiked  (in  km)  and 
maximum  altitude  (in  m)  for  the  last  10  hikes: 


Distance 

12.5  29.9  14.8  18.7  7.6  16.2  16.5  27.4  12.1  17.5 

Altitude 

342  1245  502  555  398  670  796  912  238  466 

(a)  Calculate  the  arithmetic  mean  and  median  for  both  distance  and  altitude. 

(b)  Determine  the  first  and  third  quartiles  for  both  the  distance  and  the  altitude 
variables.  Discuss  the  shape  of  the  distribution  given  the  results  of  (a)  and  (b). 

(c)  Calculate  the  interquartile  range,  absolute  median  deviation,  and  standard  de¬ 
viation  for  both  variables.  What  is  your  conclusion  about  the  variability  of  the 
data? 

(d)  One  metre  corresponds  to  approximately  3.28ft.  What  is  the  average  altitude 
when  measured  in  feet  rather  than  in  metres? 

(e)  Draw  and  interpret  the  box  plot  for  both  distance  and  altitude. 

(f)  Assume  distance  is  measured  as  only  short  (5-15  km),  moderate  (15-20  km),  and 
long  (20-30 km).  Summarize  the  grouped  data  in  a  frequency  table.  Calculate 
the  weighted  arithmetic  mean  under  the  assumption  that  the  raw  data  is  not 
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known.  Determine  the  weighted  median  under  the  assumption  that  the  values 
within  each  class  are  equally  distributed. 

(g)  What  is  the  variance  for  the  grouped  data  when  the  raw  data  is  known,  i.e.  when 
one  has  knowledge  about  the  variance  in  each  class?  How  does  it  compare  with 
the  variance  one  obtains  when  the  raw  data  is  unknown? 

(h)  Use  R  to  reproduce  the  results  of  (a),  (b),  (c),  (e),  and  (f). 

Exercise  3.2  A  gambler  notes  down  his  wins  and  losses  (in  €)  from  playing  10 

games  of  roulette  in  a  casino. 


Round 

123  4  5  6  7  8  9  10 

Won/Lost 

200  600  -200  -200  -200  -100  -100  -400  0 

(a)  Assume  v  =  —  €90  and  s  =  €294.7881.  What  is  the  result  of  round  10? 

(b)  Determine  the  mode  and  the  interquartile  range. 

(c)  A  different  gambler  plays  33  rounds  of  roulette.  His  results  are  x  =  €12  and 
s  =  €1000.  Is  it  meaningful  to  compare  the  variability  of  results  of  the  two 
players  by  using  the  coefficient  of  variation?  If  yes,  determine  the  coefficients 
of  variation;  if  no,  why  is  a  comparison  not  possible? 

Exercise  3.3  A  fashion  boutique  has  summarized  its  daily  sales  of  designer  socks  in 
different  groups:  men’s  socks,  women’s  socks,  and  children’s  socks.  Unfortunately, 
the  data  for  men’s  socks  was  lost.  Determine  the  missing  values. 


n  Arithmetic  mean  Standard  deviation 

in  €  in  € 

Women’s  wear 

Men’s  wear 

Children’s  wear 

45  16  V6 

?  ?  ? 

20  7.5  V3 

Total 

100  15  yiV55 

Exercise  3.4  The  number  of  members  of  a  millionaires’  club  were  as  follows: 


Year 

2011  2012  2013  2014  2015  2016 

Members 

23  24  27  25  30  28 

(a)  What  is  the  average  growth  rate  of  the  membership? 

(b)  Based  on  the  results  of  (a),  how  many  members  would  one  expect  in  2018? 
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Fig.  3.8  QQ-plots 

(c)  The  president  of  the  club  is  interested  in  the  number  of  members  in  2025,  the 
year  when  his  presidency  ends.  Would  it  make  sense  to  predict  the  number  of 
members  for  2025? 

In  2015,  the  members  invested  €250  million  on  the  stock  market.  10  members 
contributed  16%  of  the  investment  sum,  8  members  contributed  €60  million,  8 
members  contributed  €70  million,  and  another  4  members  contributed  the  remaining 
amount. 

(d)  Draw  the  Lorenz  curve  for  this  data. 

(e)  Calculate  and  interpret  the  standardized  Gini  coefficient. 

Exercise  3.5  Consider  the  monthly  salaries  Y  (in  Swiss  francs)  of  a  well-reputed 
software  company,  as  well  as  the  length  of  service  (in  months,  X ),  and  gender  (Z). 
Figure  3.8  shows  the  QQ-plots  for  both  Y  and  X  given  Z.  Interpret  both  graphs. 

Exercise  3.6  There  is  no  built-in  function  in  R  to  calculate  the  mode  of  a  variable. 
Program  such  a  function  yourself.  Hint:  type  ?table  and  ?names  to  recall  the 
functionality  of  these  functions.  Combine  them  in  an  intelligent  way. 

Exercise  3. 7  Consider  a  country  in  which  90  %  of  the  wealth  is  owned  by  20  %  of 
the  population,  the  so-called  upper  class.  For  simplicity,  let  us  assume  that  the  wealth 
is  distributed  equally  within  this  class. 

(a)  Draw  the  Lorenz  curve  for  this  country. 

(b)  Now  assume  a  revolution  takes  place  in  the  country  and  all  members  of  the  upper 
class  have  to  give  away  their  wealth  which  is  then  distributed  equally  across  the 
remaining  population.  Draw  the  Lorenz  curve  for  this  scenario. 

(c)  What  would  the  curve  from  (b)  look  like  if  the  entire  upper  class  left  the  country? 
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Exercise  3.8  A  bus  route  in  the  mountainous  regions  of  Romania  has  a  length  of 
418  km.  The  manager  of  the  bus  company  serving  the  route  wants  his  buses  to  finish 
a  trip  within  8  h.  The  bus  travels  the  first  1 80  km  with  an  average  speed  of  48  km/h,  the 
next  117  km  with  an  average  speed  of  37  km/h,  and  the  last  section  with  an  average 
speed  of  52  km/h. 

(a)  What  is  the  average  speed  with  which  the  bus  travels? 

(b)  Will  the  bus  finish  the  trip  in  time? 

Exercise  3.9  Four  friends  have  a  start-up  company  which  sells  vegan  ice  cream. 
Their  initial  financial  contributions  are  as  follows: 


Person 

12  3  4 

Contribution  (in  €) 

800  10300  4700  2220 

(a)  Calculate  and  draw  the  Lorenz  curve. 

(b)  Determine  and  interpret  the  standardized  Gini  coefficient. 

(c)  Does  G+  change  if  each  of  the  friends  contributes  only  half  the  amount  of 
money?  If  yes,  how  much?  If  no,  why  not? 

(d)  Use  R  to  draw  the  above  Lorenz  curve  and  to  calculate  the  Gini  coefficient. 

Exercise  3.10  Recall  the  pizza  delivery  data  which  is  described  in  Appendix  A.4. 

Use  R  to  read  in  and  analyse  the  data. 

(a)  Calculate  the  mean,  median,  minimum,  maximum,  first  quartile,  and  third  quar- 
tile  for  all  quantitative  variables. 

(b)  Determine  and  interpret  the  99  %  quantile  for  delivery  time  and  temperature. 

(c)  Write  a  function  which  calculates  the  absolute  mean  deviation.  Use  the  function 
to  calculate  the  absolute  mean  deviation  of  temperature. 

(d)  Scale  the  delivery  time  and  calculate  the  mean  and  variance  for  this  variable. 

(e)  Draw  a  box  plot  for  delivery  time  and  temperature.  The  box  plots  should  not 
highlight  extreme  values. 

(f)  Use  the  cut  command  to  create  a  new  variable  which  summarizes  delivery  time 
in  steps  of  10  min.  Calculate  the  arithmetic  mean  of  this  variable. 

(g)  Reproduce  the  QQ-plots  shown  in  Example  3.1.6. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  333 


*  Source  Toutenburg,  H.,  Heumann,  C.,  Deskriptive  Statistik ,  7th  edition,  2009, 
Springer,  Heidelberg 
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In  Chaps.  2  and  3  we  discussed  how  to  analyse  a  single  variable  using  graphs  and 
summary  statistics.  However,  in  many  situations  we  may  be  interested  in  the  interde¬ 
pendence  of  two  or  more  variables.  For  example,  suppose  we  want  to  know  whether 
male  and  female  students  in  a  college  have  any  preference  between  the  subjects 
mathematics  and  biology,  i.e.  if  there  is  any  evidence  that  male  students  prefer  math¬ 
ematics  over  biology  and  female  students  prefer  biology  over  mathematics  or  vice 
versa.  Suppose  we  choose  an  equal  number  of  male  and  female  students  and  ask  them 
about  their  preferred  subject.  We  expect  that  if  there  is  no  association  between  the 
two  variables  “gender  of  student”  (male  or  female)  and  “subject”  (mathematics  or 
biology),  then  an  equal  proportion  of  male  and  female  students  should  choose  the  sub¬ 
jects  biology  and  mathematics,  respectively.  Any  difference  in  the  proportions  may 
indicate  a  preference  of  males  or  females  for  a  particular  subject.  Similarly,  in  another 
example,  we  may  want  to  find  out  whether  female  employees  of  an  organization  are 
paid  less  than  male  employees  or  vice  versa.  Let  us  assume  again  that  we  choose 
an  equal  number  of  male  and  female  employees  and  assume  further  that  the  salary 
is  measured  as  a  binary  variable  (low-  versus  high-salary  group).  We  then  expect 
that  if  there  is  no  gender  discrimination,  the  number  of  male  and  female  employees 
in  the  lower-  and  higher- salary  groups  in  the  organization  should  be  approximately 
equal.  In  both  examples,  the  variables  considered  are  binary  and  nominal  (although 
the  salary  can  also  be  seen  as  ordinal)  and  the  data  is  summarized  in  terms  of  fre¬ 
quency  measures.  There  may,  however,  be  situations  in  which  we  are  interested  in 
associations  between  ordinal  or  continuous  variables.  Consider  a  data  set  in  which 
height,  weight,  and  age  of  infants  are  given.  Usually,  the  height  and  weight  of  infants 
increase  with  age.  Also,  the  height  of  infants  increases  with  their  weight  and  vice 
versa.  Clearly,  there  is  an  interrelation  or  association  among  the  three  variables.  In 
another  example,  two  persons  have  to  judge  participants  of  a  dance  competition  and 
rank  them  according  to  their  performance.  Now  if  we  want  to  learn  about  the  fairness 
in  the  judgment,  we  expect  that  both  the  judges  give  similar  ranks  to  each  candidate, 
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i.e.  both  judges  give  high  ranks  to  good  candidates  and  low  ranks  to  not  so  good 
candidates.  We  are  therefore  interested  in  studying  the  association  between  the  ranks 
given  by  the  two  judges.  In  all  these  examples,  the  intention  lies  in  measuring  the 
degree  of  association  between  two  (or  more)  variables.  We  therefore  need  to  study 
different  ways  of  measuring  the  association  level  for  different  types  of  variables.  In 
this  chapter,  we  present  measures  and  graphical  summaries  for  the  association  of 
two  variables — dependent  on  their  scale. 


4.1  Summarizing  the  Distribution  of  Two  Discrete  Variables 

When  both  variables  are  discrete,  then  it  is  possible  to  list  all  combinations  of  values 
of  the  two  variables  and  to  count  how  often  these  combinations  occur  in  the  data. 
Consider  the  salary  example  in  the  introduction  to  this  chapter  in  which  both  the 
variables  were  binary.  There  are  four  possible  combinations  of  variable  categories 
(female  and  low-salary  group,  female  and  high-salary  group,  male  and  low-salary 
group,  and  male  and  high-salary  group).  A  complete  description  of  the  joint  occur¬ 
rence  of  these  two  variables  can  be  given  by  counting,  for  each  combination,  the 
number  of  units  for  which  this  combination  is  measured.  In  the  following,  we  gen¬ 
eralize  this  concept  to  two  variables  where  each  can  have  an  arbitrary  (but  fixed) 
number  of  values  or  categories. 


4.1 .1  Contingency  Tables  for  Discrete  Data 

Suppose  we  have  data  on  two  discrete  variables.  This  data  can  be  described  in  a 
two-dimensional  contingency  table. 

Example  4.1.1  An  airline  conducts  a  customer  satisfaction  survey.  The  survey 
includes  questions  about  travel  class  and  satisfaction  levels  with  respect  to  different 
categories  such  as  seat  comfort,  in-flight  service,  meals,  safety,  and  other  indicators. 
Consider  the  information  on  X ,  denoting  the  travel  class  (Economy  =  “E”,  Business 
=  “B”,  First  =  “F”),  and  “Y”,  denoting  the  overall  satisfaction  with  the  flight  on  a 
scale  from  1  to  4  as  1  (poor),  2  (fair),  3  (good),  and  4  (very  good).  A  possible  response 
from  12  customers  may  look  as  follows: 


i 

Passenger  number 

1  2  3  4  5  6  7  8  9  10  11  12 

Travel  class 

Satisfaction 

EEEBEBFEEB  E  B 

2413124324  3  3 

We  can  calculate  the  absolute  frequencies  for  each  of  the  combination  of  observed 
values.  For  example,  there  are  2  passengers  (passenger  numbers  3  and  5)  who  were 
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Table  4.1  Contingency  table  for  travel  class  and  satisfaction 


Overall  rating  of  flight  quality 

Total  (row) 

Poor 

Fair 

Good 

Very  good 

Travel  class 

Economy 

2 

2 

2 

1 

7 

business 

0 

1 

2 

1 

4 

first 

0 

0 

0 

1 

1 

Total 

(column) 

2 

3 

4 

3 

12 

flying  in  economy  class  and  rated  the  flight  quality  as  poor,  there  were  no  passengers 
from  both  business  class  and  first  class  who  rated  the  flight  quality  as  poor;  there 
were  2  passengers  who  were  flying  in  economy  class  and  rated  the  quality  as  fair 
(2),  and  so  on.  Table 4.1  is  a  two-dimensional  table  summarizing  this  information. 

Note  that  we  not  only  summarize  the  joint  frequency  distribution  of  the  two 
variables  but  also  the  distributions  of  the  individual  variables.  Summing  up  the  rows 
and  columns  of  the  table  gives  the  respective  frequency  distributions.  For  example, 
the  last  column  of  the  table  demonstrates  that  7  passengers  were  flying  in  economy 
class,  4  passengers  were  flying  in  business  class  and  1  passenger  in  first  class. 

Now  we  extend  this  example  and  discuss  a  general  framework  to  summarize 
the  absolute  frequencies  of  two  discrete  variables  in  contingency  tables.  We  use 
the  following  notations:  Let  x\,  X2, . . . ,  be  the  k  classes  of  a  variable  X  and  let 
yi,  y2,  •  •  • ,  yi  be  the  /  classes  of  another  variable  Y .  We  assume  that  both  X  and  Y 
are  discrete  variables.  It  is  possible  to  summarize  the  absolute  frequencies  riij  related 
to  (xi,yj),i  =  1,  2,  . . . ,  k,  j  =  1,  2, ...,/,  in  a  k  x  /  contingency  table  as  shown 
in  Table 4.2. 

Table  4.2  k  x  /  contingency  table 


Y 

Vl 

yj 

yi 

Total 

(rows) 

vi 

nn 

•  •  • 

nij 

•  •  • 

nil 

n  i+ 

V2 

nil 

•  •  • 

n2j 

•  •  • 

n2l 

n2+ 

Xi 

nn 

mj 

. . . 

nn 

ni+ 

Xk 

nn 

.  .  . 

nkj 

.  .  . 

nkl 

nk+ 

Total 

(columns) 

n+ 1 

•  •  • 

n+j 

. . . 

n+i 

n 
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We  denote  the  sum  of  the  ith  row  as  rii+  =  Xij=\  nij  and  the  sum  over  the  7  th 


column  as  n+j  =  X/=i  nij •  The  total  number  of  observations  is  therefore 


k  l  k  l 

=  Z"+i  = 

i  =  1  7  =  1  i=l  7  =  1 


(4.1) 


Remark  4.1.1  Note  that  it  is  also  possible  to  use  the  relative  frequencies  fij  =  ntj  /n 
instead  of  the  absolute  frequencies  riij  in  Table  4.2,  see  Example  4.1.2. 


4.1 .2  Joint,  Marginal,  and  Conditional  Frequency  Distributions 

When  the  data  on  two  variables  are  summarized  in  a  contingency  table,  there  are 
several  concepts  which  can  help  us  in  studying  the  characteristics  of  the  data.  For 
example,  how  the  values  of  both  the  variables  behave  jointly,  how  the  values  of 
one  variable  behave  when  another  variable  is  kept  fixed  etc.  These  features  can  be 
studied  using  the  concepts  of  joint  frequency  distribution,  marginal  frequency  distri¬ 
bution,  and  conditional  frequency  distribution.  If  relative  frequency  is  used  instead 
of  absolute  frequency,  then  we  speak  of  the  joint  relative  frequency  distribution,  mar¬ 
ginal  relative  frequency  distribution,  and  conditional  relative  frequency  distribution. 

Definition  4.1.1  Using  the  notations  of  Table 4.2,  we  define  the  following: 


The  frequencies  ni  j  represent  the  joint  frequency  distribution  of  X  and  Y. 
The  frequencies  represent  the  marginal  frequency  distribution  of  X. 
The  frequencies  n+j  represent  the  marginal  frequency  distribution  of  Y . 

X\Y 

We  define  f^j  =  riij/n+j  to  be  the  conditional  frequency  distribution  of 
X  given  Y  =  yj. 

We  define  f-y  =  ri[j  /rii+  to  be  the  conditional  frequency  distribution  of  Y 
given  X  =  Xi 


The  frequencies  fij  represent  the  joint  relative  frequency  distribution  of  X 
and  Y . 

The  frequencies  //+  =  Xy=i  fij  represent  the  marginal  relative  frequency 
distribution  of  X. 


The  frequencies  /+7  =  X/=i  fij  represent  the  marginal  relative  frequency 
distribution  of  Y. 

X\Y 

We  define  /•  | .  =  fj  /f+j  to  be  the  conditional  relative  frequency  distrib¬ 

ution  of  X  given  Y  =  yj. 


Y\X 

We  define  f-y  =  f)j  /f  +  to  be  the  conditional  relative  frequency  distribu¬ 


tion  of  Y  given  X  =  xL . 
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Table  4.3  Contingency  table  for  travel  class  and  satisfaction 


Overall  rating  of  flight  quality 

Poor 

Fair 

Good 

Very  good 

Total  (rows) 

Travel  class 

Economy 

10 

33 

15 

4 

62 

Business 

0 

3 

20 

2 

25 

First 

0 

0 

5 

8 

13 

Total 

(columns) 

10 

36 

40 

14 

100 

Note  that  for  a  bivariate  joint  frequency  distribution,  there  will  only  be  two  mar¬ 
ginal  (or  relative)  frequency  distributions  but  possibly  more  than  two  conditional  (or 
relative)  frequency  distributions. 

Example  4.1.2  Recall  the  setup  of  Example  4.1.1.  We  now  collect  and  evaluate  the 
responses  of  100  customers  (instead  of  12  passengers  as  in  Example  4.1.1)  regarding 
their  choice  of  the  travel  class  and  their  overall  satisfaction  with  the  flight  quality. 

The  data  is  provided  in  Table  4.3  where  each  of  the  cell  entries  illustrates  how 
many  out  of  100  passengers  answered  jc/  and  yj  \  for  example,  the  first  entry  “10” 
indicates  that  10  passengers  were  flying  in  economy  class  and  described  the  overall 
service  quality  as  poor. 

•  The  marginal  frequency  distributions  are  displayed  in  the  last  column  and  last  row, 
respectively.  For  example,  the  marginal  distribution  of  X  refers  to  the  frequency 
table  of  “travel  class”  (X)  and  tells  us  that  62  passengers  were  flying  in  economy 
class,  25  in  business  class,  and  13  in  first  class.  Similarly,  the  marginal  distribution 
of  “overall  rating  of  flight  quality”  (F)  tells  us  that  10  passengers  rated  the  quality 
as  poor,  36  as  fair,  40  as  good,  and  14  as  very  good. 

•  The  conditional  frequency  distributions  give  us  an  idea  about  the  behaviour  of  one 
variable  when  the  other  one  is  kept  fixed.  For  example,  the  conditional  distribution 
of  the  “overall  rating  of  flight  quality”  (F)  among  passengers  who  were  flying 

in  economy  class  (/F|x=Economy)  gives  /t| J  =  10/62  16  %  which  means  that 

approximately  16  %  of  the  customers  in  economy  class  are  rating  the  quality  as 

Y\X 

poor,  /2|{  =  33/62  53  %  of  the  customers  in  economy  class  are  rating  the 

Y\X 

quality  as  fair,  /3 1 J  =15/62  %  24  %  of  the  customers  in  economy  class  are  rating 

Y\X 

the  quality  as  good  and  /4|j  =  4/62  ^  7  %  of  the  customers  in  economy  class 

are  rating  the  quality  as  very  good.  Similarly,  /3  ^  =  20/25  ~  80  %  which  means 
that  80  %  of  the  customers  in  business  class  are  rating  the  quality  as  good  and 
so  on. 

•  The  conditional  frequency  distribution  of  the  “travel  class”  (X)  of  passengers 

given  the  “overall  rating  of  flight  quality”  (F)  is  obtained  by  /x|F=Satisfaction  level- 

X\Y 

For  example,  fx\Y=good  gives  =  15/40  =  37.5  %  which  means  that  37.5  % 
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of  the  passengers  who  rated  the  flight  to  be  good  travelled  in  economy  class, 

X\Y 

f2 1 3  =  20/40  =  50  %  of  the  passengers  who  rated  the  flight  to  be  good  travelled 

X\Y 

in  business  class  and  /3  ^  =5 /40  =  12.5  %  of  the  passengers  who  rated  the  flight 
to  be  good  travelled  in  first  class. 

•  In  total,  we  have  100  customers  and  hence 

k  i 

n  =  ni+  =  62  +  25  +  13  =  n+j  =  10  +  36  +  40+  14 
i= 1  j= i 

k  i 

=  ^  =  10  +  33  +  15  +  4  +  +3  +  20  +  2  +  5  +  8  =  100 

i= 1 j= 1 

•  Alternatively,  we  can  summarize  X  and  Y  using  the  relative  frequencies  as  follows: 


Travel  class 

Overall  rating  of  flight  quality 

Poor  Fair  Good  Very  good 

Total  (rows) 

Economy 

Business 

First 

10  33  15  4 

100  100  100  100 

n  3  20  2 

u  100  100  100 

0  0  5  8 

u  u  100  100 

62 

100 

25 

100 

13 

100 

Total  (columns) 

10  36  40  14 

100  100  100  100 

1 

To  produce  the  frequency  table  without  the  marginal  distributions,  we  can  use 
the  R  command  table(X,Y).  To  obtain  the  full  contingency  table  including  the 
marginal  distributions  in  R ,  one  can  use  the  function  addmargins  ( ) .  For  the  relative 
frequencies,  the  function  prop .  table  ( )  can  be  used.  In  summary,  a  full  contingency 
table  is  obtained  by  using 

addmargins (table (X,Y) ) 
addmargins (prop . table (table (X , Y) ) ) 


4.1 .3  Graphical  Representation  of  Two  Nominal 
or  Ordinal  Variables 

Bar  charts  (see  Sect.  2.3.1)  can  be  used  to  graphically  summarize  the  association 
between  two  nominal  or  two  ordinal  variables.  The  bar  chart  is  drawn  for  X  and  the 
categories  of  Y  are  represented  by  separated  bars  or  stacked  bars  for  each  category 
of  X.  In  this  way,  we  summarize  the  joint  distribution  of  the  contingency  table. 

Example  4.1.3  Consider  Example  4.1.2.  There  are  62  passengers  flying  in  the  econ¬ 
omy  class.  From  these  62  passengers,  10  rated  the  quality  of  the  flight  as  poor,  33  as 
fair,  15  as  good,  and  4  as  very  good.  This  means  for  X  =  x\(=  Economy),  we  can 
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Rating  Rating 


3=good 

1=poor 

3=good 

4=very  good 

2=fair 

4=very  good 

Class  Class 

(a)  Categories  next  to  each  other  (b)  Categories  stacked 


Fig.  4.1  Bar  charts  for  travel  class  and  rating  of  satisfaction 


either  place  4  bars  next  to  each  other,  as  in  Fig.  4. la,  or  we  can  stack  them  on  top 
of  each  other,  as  in  Fig.  4. lb.  The  same  can  be  done  for  the  other  categories  of  X , 
see  Fig.  4.1.  Stacked  and  stratified  bar  charts  are  prepared  in  R  by  calling  the  library 
lattice  and  using  the  function  bar  chart.  In  detail,  one  needs  to  specify: 

Class  <-  c (rep('l:  Economy 1 , 62) ,rep( 1 2 :  Business 25) , 
rep('3:  First1 ,13)) 

Rating  <-  c(rep( 1 l=poor 1 , 10) ,  rep( 1 2=f air  1 ,33) , . . . ) 
library (lattice) 

barchart (table (Class , Rating) ,horizontal=FALSE, stack=FALSE) 
barchart (table (Class , Rating) , horizontal=FALSE , stack=TRUE) 


Remark  4.1.2  There  are  several  other  options  in  R  to  specify  stratified  bar  charts. 
We  refer  the  interested  reader  to  Exercise  2.6  to  explore  how  the  R  package  ggplot2 
can  be  used  to  make  such  graphics.  Sometimes  it  can  also  be  useful  to  visualize  the 
difference  of  two  variables  and  not  stack  or  stratify  the  bars,  see  Exercise  2.1. 

Independence  and  Expected  Frequencies  An  important  statistical  concept  is  inde¬ 
pendence.  In  this  section,  we  touch  upon  its  descriptive  aspects,  see  Chaps.  6 
(Sect.  6.5)  and  7  (Sect.  7.5)  for  more  theoretical  details.  Two  variables  are  considered 
to  be  independent  if  the  observations  on  one  variable  do  not  influence  the  observa¬ 
tions  on  the  other  variable.  For  example,  suppose  two  different  persons  roll  a  die 
separately;  then,  the  outcomes  of  their  rolls  do  not  depend  on  each  other.  So  we 
can  say  that  the  two  observations  are  independent.  In  the  context  of  contingency 
tables,  two  variables  are  independent  of  each  other  when  the  joint  relative  frequency 
equals  the  product  of  the  marginal  relative  frequencies  of  the  two  variables,  i.e.  the 
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Table  4.4  Observed  and  expected  absolute  frequencies  for  the  airline  survey 


Overall  rating  of  flight  quality 


Poor 

Fair 

Good 

Very  good 

Total 

Travel 

Economy 

10  (6.2) 

33  (22.32) 

15  (24.8) 

4  (8.68) 

62 

class 

Business 

0  (2.5) 

3  (9.0) 

20(10.0) 

2  (3.5) 

25 

First 

0(1.3) 

0  (4.68) 

5  (5.2) 

8  (1.82) 

13 

Total 

10 

36 

40 

14 

100 

following  equation  holds: 

fij  =  fi+f+j  •  (4.2) 

The  expected  absolute  frequencies  under  independence  are  obtained  by 

r  ni  +  n+j  ni+n+j 

nij  =  n  fij  =  n - =  -  •  (4.3) 

n  n  n 

Note  that  the  absolute  frequencies  are  always  integers  but  the  expected  absolute 
frequencies  may  not  always  be  integers. 


Example  4.1.4  Recall  Example  4.1.2.  The  expected  absolute  frequencies  for  the 
contingency  table  can  be  calculated  using  (4.3).  For  example, 

62  •  10  ^  „  62  •  36 

n\\  =  - =  6.2,  n \?  = - =  22.32  etc. 

100  100 

Table  4.4  lists  both  the  observed  absolute  frequency  and  expected  absolute  frequency 
(in  brackets). 

To  calculate  the  expected  absolute  frequencies  in  R,  we  can  access  the  “expected” 
object  returned  from  a  x2-test  applied  to  the  respective  contingency  table  as  follows: 

chisq. test (table (Class , Rating) ) $expected 

A  detailed  motivation  and  explanation  of  this  command  is  given  in  Sect.  10.8. 


4.2  Measures  of  Association  for  Two  Discrete  Variables 

When  two  variables  are  not  independent,  then  they  are  associated.  Their  association 
can  be  weak  or  strong.  Now  we  describe  some  popular  measures  of  association. 
Measures  of  association  describe  the  degree  of  association  between  two  variables 
and  can  have  a  direction  as  well.  Note  that  if  variables  are  defined  on  a  nominal  scale, 
then  nothing  can  be  said  about  the  direction  of  association,  only  about  the  strength. 

Let  us  first  consider  a  2  x  2  contingency  table  which  is  a  special  case  of  a  k  x  l 
contingency  table,  see  Table 4.5. 
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Table  4.5  2  x  2  contingency  table 


Y 

VI 

yi 

Total  (row) 

XI 

a 

b 

a  +  b 

*2 

c 

d 

c  +  d 

Total  (column) 

a  +  c 

b  +  d 

n 

Table  4.6  2  x  2  contingency  table 


Persons 

Not  affected 

Affected 

Total  (row) 

Vaccinated 

90 

10 

100 

Vaccination 

Not  vaccinated 

40 

60 

100 

Total  (column) 

130 

70 

200 

The  variables  X  and  Y  are  independent  if 

a  b  a  +  b 


a  +  c  b  +  d 


n 


(4.4) 


or  equivalently  if 


a  = 


(< a  +  b)  (a  +  c) 


n 


(4.5) 


Note  that  some  other  forms  of  the  conditions  (4.4)-(4.5)  can  also  be  derived  in 
terms  of  a ,  b ,  c,  and  d. 


Example  4.2. 1  Suppose  a  vaccination  against  flu  (influenza)  is  given  to  200  persons. 
Some  of  the  persons  may  get  affected  by  flu  despite  the  vaccination.  The  data  is 
summarized  in  Table  4.6.  Using  the  notations  of  Table 4.5,  we  have  a  =  90,  b  = 
10,  c  =  40,  d  =  60,  and  thus,  ( a  -\-b)(a  -\-  c)/n  =  100  •  130/200  =  65  which  is  less 
than  a  =  90.  Hence,  being  affected  by  flu  is  not  independent  of  the  vaccination, 
i.e.  whether  one  is  vaccinated  or  not  has  an  influence  on  getting  affected  by  flu. 
In  the  vaccinated  group,  only  10  of  100  persons  are  affected  by  flu  while  in  the 
group  not  vaccinated  60  of  100  persons  are  affected.  Another  interpretation  is  that 
if  independence  holds,  then  we  would  expect  65  persons  to  be  not  affected  by  flu  in 
the  vaccinated  group  but  we  observe  90  persons.  This  shows  that  vaccination  has  a 
protective  effect. 

To  gain  a  better  understanding  about  the  strength  of  association  between  two 
variables,  we  need  to  develop  the  concept  of  dependence  and  independence  further. 
The  following  three  subsections  illustrate  this  in  more  detail. 
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4.2.1  Pearson's  %2  Statistic 


We  now  introduce  Pearson’s  x2  statistic  which  is  used  for  measuring  the  association 
between  variables  in  a  contingency  table  and  plays  an  important  role  in  the  construc¬ 
tion  of  statistical  tests,  see  Sect.  10.8.  The  x2  statistic  or  x2  coefficient  for  a  k  x  l 
contingency  table  is  given  as 


k  l  /  ~  \2 

2  _  V  Hi  niJ) 

X 


i  =  l  7  =  1 


nij 


ij 


ni+n+j_y 


n 


i  =  1  7  =  1 


ni+n+j 


(4.6) 


ft 


A  simpler  formula  for  2  x  2  contingency  tables  is 

2  n{ad  —  be)2 

^  (a  +  Z?)(c  +  d)(ft  +  c)(Z?  +  J)  ^  ^ 

The  idea  behind  the  x2  coefficient  is  that  when  the  relationship  between  two 
variables  is  stronger,  then  the  deviations  between  observed  and  expected  frequencies 
are  expected  to  be  higher  (because  the  expected  frequencies  are  calculated  assuming 
independence)  and  this  indicates  a  stronger  relationship  between  the  two  variables.  If 
observed  and  expected  frequencies  are  identical  or  similar,  then  this  is  an  indication 
that  the  association  between  the  two  variables  is  weak  and  the  variables  may  even  be 
independent.  The  x2  statistic  for  a  k  x  /  contingency  table  sums  up  all  the  differences 
between  the  observed  and  expected  frequencies,  squares  them,  and  scales  them  with 
respect  to  the  expected  frequencies.  The  squaring  of  the  difference  makes  the  statistic 
independent  of  the  positive  and  negative  signs  of  the  difference  between  observed 
and  expected  frequencies.  The  range  of  values  for  x2  is 

0  <  x2  5  «( min(k,  /)  -  1).  (4.8) 


Note  that  min(k,  /)  is  the  minimum  function  and  simply  returns  the  smaller  of  the 
two  numbers  k  and  /.  For  example,  min (3,  4)  returns  the  value  3.  Consequently  the 
values  of  x2  obtained  from  (4.6)  can  be  compared  with  the  range  from  (4.8).  A 
value  of  x2  close  to  zero  indicates  a  weak  association  and  a  value  of  x2  close  to 
ft  (min (k,  /)  —  1)  indicates  a  strong  association  between  the  two  variables.  Note  that 
the  range  of  x2  depends  on  ft,  k  and  /,  i.e.  the  sample  size  and  the  dimension  of  the 
contingency  table. 

The  x2  statistic  is  a  symmetric  measure  in  the  sense  that  its  value  does  not  depend 
on  which  variable  is  defined  as  X  and  which  as  Y. 


Example  4.2.2  Consider  Examples  4.1.2  and  4.1.4.  Using  the  values  from  Table 4.4, 
we  can  calculate  the  x2  statistic  as 


?  (10  -  6.2)' 
X  =  - 


6.2 


+ 


(33  -  22.32)' 
22.32 


+  •••  + 


(8  -  1.82)' 
1.82 


=  57.95064 


The  maximum  possible  value  for  the  x2  statistic  is  100(min(4,  3)  —  1)  =  200.  Thus, 
X2  ~  57  indicates  a  moderate  association  between  “travel  class”  and  “overall  rating 
of  flight  quality”  of  the  passengers.  In  R,  we  obtain  this  result  as  follows: 


chi sq. test (table (Class , Rating) ) $statistic 
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4.2.2  Cramer's  V  Statistic 


A  problem  with  Pearson’s  x2  coefficient  is  that  the  range  of  its  maximum  value 
depends  on  the  sample  size  and  the  size  of  the  contingency  table.  These  values 
may  vary  in  different  situations.  To  overcome  this  problem,  the  coefficient  can  be 
standardized  to  lie  between  0  and  1  so  that  it  is  independent  of  the  sample  size  as  well 
as  the  dimension  of  the  contingency  table.  Since  ft  (min  (k,  /)  —  1)  was  the  maximal 
value  of  the  x2  statistic,  dividing  x2  by  this  maximal  value  automatically  leads  to  a 
scaled  version  with  maximal  value  1.  This  idea  is  used  by  Cramer’s  V  statistic  which 
for  a  k  x  /  contingency  table  is  given  by 


X^ 


ft  (min  (k,  l) 


(4-9) 


The  closer  the  value  of  V  gets  to  1,  the  stronger  the  association  between  the  two 
variables. 


Example  4.2.3  Consider  Example  4.2.2.  The  obtained  x2  statistic  is  57.95064.  To 
obtain  Cramer’s  V,  we  just  need  to  calculate 


ft  (min  (k,  /)  —  1) 

This  indicates  a  moderate  association  between  “travel  class”  and  “overall  rating  of 
flight  quality”  because  0.54  lies  in  the  middle  of  0  and  1.  In  R,  there  are  two  options 
to  calculate  V :  (i)  to  calculate  the  x2  statistic  and  then  adjust  it  as  in  (4.9),  (ii)  to  use 
the  functions  assoc  stats  and  xtabs  contained  in  the  package  vcd  as  follows: 

library (vcd) 

assocstats  (xtabs  (^Class+Rating) ) 


57.95064 
100(3  -  1) 


0.54. 


(4.10) 


4.2.3  Contingency  Coefficient  C 


Another  option  to  standardize  x2  is  given  by  a  corrected  version  of  Pearson’s  con¬ 
tingency  coefficient: 


with 


Ccorr  — 


c 


c 


max 


/  min (k,  l) 

/  X2 

V  min (k,  /)  —  1  \ 

X2  +n 

c  = 


x^ 


X  +n 


and  Cmax  — 


min (k,  l)  —  1 
min (k,  l) 


(4.11) 


(4.12) 


It  always  lies  between  0  and  1.  The  closer  the  value  of  C  is  to  1,  the  stronger  the 
association. 
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Example  4.2.4  We  know  from  Example  4.2.2  that  the  \2  statistic  for  travel  class  and 
satisfaction  level  is  57.95064.  To  calculate  CCOrr ,  we  need  the  following  calculations: 


C  = 


Ccorr  — 


57.95064 


57.95064  +  100 


—  0.606,  Cm  ax  — 


min(4,  3)  —  1 
min (4,  3) 


C 


C 


max 


0.606 

0.816 


0.74 


0.816, 


There  is  a  moderate  to  strong  association  between  “travel  class”  and  “overall  rating 
of  flight  quality”  of  the  passengers.  We  can  compute  C  in  R  using  the  vcd  package 
as  follows: 


library (vcd) 

Cmax  =  sqrt((min(c(3,4))-l)/min(c(3,4))) 
assocstats  (xtabs  (^Class+Rating) )  $cont/Cmax 


4.2.4  Relative  Risks  and  Odds  Ratios 


We  now  introduce  the  concepts  of  odds  ratios  and  relative  risks.  Consider  a  2  x  2 
contingency  table  as  introduced  in  Table 4.5.  Now  suppose  we  have  two  variables 
X  and  Y  with  their  conditional  distributions  fX) Y  and  In  the  context  of  a 

2  x  2  contingency  table,  /jfj  =  nn/n+i,  ffyY  =  nn/n+2,  f^2  =  «22/«+2,and 
X\Y 

f 2 1  j  =  «2i/fl+i-  The  relative  risks  are  defined  as  the  ratio  of  two  conditional  dis¬ 
tributions,  for  example 


fX\Y 

7111 


nw/n+i  a/(a  +  c ) 


and 


fX\Y 
7  2|  1 


n2\/n+\  c/(a  +  c) 


f^Y  nn/n+2  b/(b  +  d )  fWy  «22/«+2  d/(b  +  d ) 

The  odds  ratio  is  defined  as  the  ratio  of  these  relative  risks  from  (4.13)  as 


(4.13) 


OR  = 


rX\Y  ,rX\Y 

71 1 1  /7  i|2 
fX\Y  /fX\Y 
7  2|  1  /  7  2|2 


fX\Y  ,X\Y 
7 i|i  J 2\2  _  ad 

rX\Y  ,X\Y  ~ 

72|1  7 1|2 


(4.14) 


Alternatively,  the  odds  ratio  can  be  defined  as  the  ratio  of  the  chances  for  “disease”, 
a/b  (number  of  smokers  with  the  disease  divided  by  the  number  of  non-smokers 
with  the  disease),  and  no  disease,  c/d  (number  of  smokers  with  no  disease  divided 
by  the  number  of  non-smokers  with  no  disease). 

The  relative  risks  compare  proportions,  while  the  odds  ratio  compares  odds. 
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Example  4.2.5  A  classical  example  refers  to  the  possible  association  of  smoking 
with  a  particular  disease.  Consider  the  following  data  on  240  individuals: 


Smoking 

Yes 

No 

Total  (row) 

Disease  Yes 

34 

66 

100 

No 

22 

118 

140 

Total  (column) 

56 

184 

240 

We  calculate  the  following  relative  risks: 


rX\Y 

Jl\l 

fX\Y 

J  1\2 


34/56 

66/184 


1.69 


and 


rX\Y 

J2\l 

fX\Y 

J  2|2 


22/56 

118/184 


**0.61. 


(4.15) 


Thus,  the  proportion  of  individuals  with  the  disease  is  1.69  times  higher  among 
smokers  when  compared  with  non-smokers.  Similarly,  the  proportion  of  healthy 
individuals  is  0.61  times  smaller  among  smokers  when  compared  with  non-smokers. 

The  relative  risks  are  calculated  to  compare  the  proportion  of  sick  or  healthy 
patients  between  smokers  and  non-smokers.  Using  these  two  relative  risks,  the  odds 
ratio  is  obtained  as 


OR 


34  x  118 
66  x  22 


2.76. 


We  can  interpret  this  outcome  as  follows:  (i)  the  chances  of  smoking  are  2.76  times 
higher  for  individuals  with  the  disease  compared  with  healthy  individuals  (follows 
from  definition  (4.14)).  We  can  also  say  that  (ii)  the  chances  of  having  the  particular 
disease  is  2.76  times  higher  for  smokers  compared  with  non-smokers.  If  we  inter¬ 
change  either  one  of  the  “Yes”  and  “No”  columns  or  the  “Yes”  and  “No”  rows,  we 
obtain  OR  =  1/2.76  **  0.36,  giving  us  further  interpretations:  (iii)  the  chances  of 
smoking  are  0.36  times  lower  for  individuals  without  disease  compared  with  indi¬ 
viduals  with  the  disease,  and  (iv)  the  chance  of  having  the  particular  disease  is  0.36 
times  lower  for  non-smokers  compared  with  smokers.  Note  that  all  four  interpreta¬ 
tions  are  correct  and  one  needs  to  choose  the  right  interpretation  in  the  light  of  the 
experimental  situation  and  the  question  of  interest. 


4.3  Association  Between  Ordinal  and  Continuous  Variables 

4.3.1  Graphical  Representation  of  Two  Continuous  Variables 

A  simple  way  to  graphically  summarize  the  association  between  two  continuous 
variables  is  to  plot  the  paired  observations  of  the  two  variables  in  a  two-dimensional 
coordinate  system.  If  n  paired  observations  for  two  continuous  variables  X  and  Y 
are  available  as  (v/,  y/),  i  =  1,2 , ...  ,n,  then  all  such  observations  can  be  plotted 
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^Bravais-Pearson  0.91,  ^Spearman  0.87 
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(c)  Moderate  positive  relationship 
Fig.  4.2  Scatter  plots 


X 

(d)  Moderate  negative  relationship 


in  a  single  graph.  This  graph  is  called  a  scatter  plot.  Such  a  plot  reveals  possible 
relationships  and  trends  between  the  two  variables.  For  example,  Figs.  4.2  and  4.3 
show  scatter  plots  with  six  different  types  of  association. 

•  Figure  4.2a  shows  increasing  values  of  Y  for  increasing  values  of  X.  We  call  this 
relationship  positive  association.  The  relationship  between  X  and  Y  is  nearly  linear 
because  all  the  points  lie  around  a  straight  line. 

•  Figure  4.2b  shows  decreasing  values  of  Y  for  increasing  values  of  X.  We  call  this 
relationship  negative  association. 

•  Figure  4.2c  tells  us  the  same  as  Fig.  4.2a,  except  that  the  positive  association  is 
weaker. 
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(a)  No  clear  relationship 
Fig.  4.3  Continues  Fig.  4.2 — more  scatter  plots 


X 

(b)  Nonlinear  relationship 


•  Figure  4. 2d  tells  us  the  same  as  Fig.  4.2b,  except  that  the  negative  association  is 
weaker. 

•  Figure  4.3a  shows  that  as  the  X-values  increase,  the  values  of  Y  neither  increase 
nor  decrease.  This  indicates  that  there  is  no  clear  relationship  between  X  and  Y 
and  highlights  the  lack  of  association  between  X  and  Y . 

•  Figure  4.3b  illustrates  a  nonlinear  relationship  between  X-  and  F-values. 

Example  4.3.1  To  explore  the  possible  relationship  between  the  overall  number  of 
tweets  with  the  number  of  followers  on  Twitter,  we  take  a  sample  of  10  prime 
ministers  and  heads  of  state  in  different  countries  as  of  June  2014  and  obtain  the 
following  data: 


Name 

Tweets 

Followers 

Angela  Merkel 

25 

7194 

Barack  Obama 

11,800 

43,400,000 

Jacob  Zuma 

99 

324,000 

Dilma  Rousseff 

1934 

2,330,000 

Sauli  Niinisto 

199 

39,000 

Vladimir  Putin 

2539 

189,000 

Francois  Hollande 

4334 

639,000 

David  Cameron 

952 

688,000 

Enrique  P.  Nieto 

3245 

2,690,000 

John  Key 

2468 

110,000 

The  tweets  are  denoted  by  Xi  and  the  followers  are  denoted  by  y* ,  i  =  1 ,  2, . . . ,  10. 
We  plot  paired  observations  (v; ,  y* )  into  a  cartesian  coordinate  system.  For  example, 
we  plot  (vi,  y  i)  =  (25,  7194)  for  Angela  Merkel,  (x2,  yi)  =  (11,  800,  43,  400,  000) 
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Fig.  4.4  Scatter  plot 
between  tweets  and  followers 
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for  Barack  Obama,  and  so  on.  Figure  4.4  shows  the  scatter  plot  for  the  number  of 
tweets  and  the  number  of  followers  (on  a  log- scale). 

One  can  see  that  there  is  a  positive  association  between  the  number  of  tweets  and 
the  number  of  followers.  This  does,  however,  not  imply  a  causal  relationship:  it  is 
not  necessarily  because  someone  tweets  more  he/she  has  more  followers  or  because 
someone  has  more  followers  he/she  tweets  more;  the  scatter  plot  just  describes  that 
those  with  more  tweets  have  more  followers.  In  R,  we  produce  this  scatter  plot  by 
the  plot  command: 

tweets  <-  c (25, 11800,99, ... ) 
followers  <-  c (7194,43400000 ,... ) 
plot (tweets , followers) 


4.3.2  Correlation  Coefficient 


Suppose  two  variables  X  and  Y  are  measured  on  a  continuous  scale  and  are  linearly 
related  like  Y  =  a  +  b  X  where  a  and  b  are  constant  values.  The  correlation  coef¬ 
ficient  r(X ,  Y)  =  r  measures  the  degree  of  linear  relationship  between  X  and  Y 
using 


with 


r  _  -  x)(yj  -  y)  _  Sxy 

y/ZUiti  -  &  •  T.U(yi  -  9)2 

n  n 

Sxx  =  X(x‘  -  =  syy  =  -  9 )2  =  n*Y’ 

i= 1  i  =  1 


(4.16) 


(4.17) 


and 

n  n 

Sxy  =  XjXi  ~  x)(y>  ~y)  =  X  X‘ yi  - nxy  ■ 
1  =  1  1  =  1 


(4.18) 
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Karl  Pearson  (1857-1936)  presented  the  first  rigorous  treatment  of  correlation  and 
acknowledged  Auguste  Bravais  (1811-1863)  for  ascertaining  the  initial  mathemati¬ 
cal  formulae  for  correlation.  This  is  why  the  correlation  coefficient  is  also  known  as 
the  Bravais-Pearson  correlation  coefficient. 

The  correlation  coefficient  is  independent  of  the  units  of  measurement  of  X  and 
Y.  For  example,  if  someone  measures  the  height  and  weight  in  metres  and  kilograms 
respectively  and  another  person  measures  them  in  centimetres  and  grams,  respec¬ 
tively,  then  the  correlation  coefficient  between  the  two  sets  of  data  will  be  the  same. 
The  correlation  coefficient  is  symmetric,  i.e.  r(X,  Y)  =  r(Y,  X).  The  limits  of  r  are 
—  1  <  r  <  1 .  If  all  the  points  in  a  scatter  plot  lie  exactly  on  a  straight  line,  then  the 
linear  relationship  between  X  and  Y  is  perfect  and  \r\  =  1,  see  also  Exercise  4.7.  If 
the  relationship  between  X  and  Y  is  (i)  perfectly  linear  and  increasing,  then  r  =  +1 
and  (ii)  perfectly  linear  and  decreasing,  then  r  —  —  1.  The  signs  of  r  thus  determine 
the  direction  of  the  association.  If  r  is  close  to  zero,  then  it  indicates  that  the  variables 
are  independent  or  the  relationship  is  not  linear.  Note  that  if  the  relationship  between 
X  and  Y  is  nonlinear,  then  the  degree  of  linear  relationship  may  be  low  and  r  is  then 
close  to  zero  even  if  the  variables  are  clearly  not  independent.  Note  that  r  (X,  X)  =  1 
and  r(X,  -X)  =  -1. 

Example  4.3.2  Look  again  at  the  scatter  plots  in  Figs.  4.2  and  4.3.  We  observe  strong 
positive  linear  correlation  in  Fig.  4.2a  (r  =  0.91),  strong  negative  linear  correlation 
in  Fig.  4.2b  (r  =  —0.92),  moderate  positive  linear  correlation  in  Fig.  4.2c  (r  =  0.50), 
moderate  negative  linear  association  in  Fig.  4. 2d  (r  =  —0.56),  no  visible  correlation 
in  Fig.  4.3a  (r  =  0.03),  and  strong  nonlinear  (but  not  so  strong  linear)  correlation  in 
Fig.  4.3b  (r  =  0.64). 

Example  4.3.3  In  a  decathlon  competition,  a  group  of  athletes  are  competing  with 
each  other  in  10  different  track  and  field  events.  Suppose  we  are  interested  in  how 
the  results  of  the  100-m  race  relate  to  the  results  of  the  long  jump  competition.  The 
correlation  coefficient  for  the  100-m  race  (X,  in  seconds)  and  the  long  jump  event 
(T,  in  metres)  for  5  athletes  participating  in  the  2004  Olympic  Games  (see  also 
Appendix  A. 4)  are  listed  in  Table  4.7. 

To  calculate  the  correlation  coefficient,  we  need  the  following  summary  statistics: 

1 

x  =  -(10.85  +  10.44  +  10.50  +  10.89  +  10.62)  =  10.66 
1 

y  =  -(7.84  +  7.96  +  7.81  +  7.47  +  7.74)  =  7.764 
Sxx  =  (10.85  -  10.66)2  +  (10.44  -  10.66)2  +  •  •  •  +  (10.62  -  10.66)2  =  0.1646 


Table  4.7  Results  of  100-m 

l 

Xi 

yt 

idvc  cuiu  luiig  jump  ui  o 

athletes  Roman  Sebrle 

10.85 

7.84 

Bryan  Clay 

10.44 

7.96 

Dmitriy  Karpov 

10.50 

7.81 

Dean  Macey 

10.89 

7.47 

Chiel  Warners 

10.62 

7.74 

84 


4  Association  of  Two  Variables 


Syy  =  (7.84  -  7.764)2  +  (7.96  -  7.764)2  +  •  •  •  +  (7.74  -  7.764)2  =  0.13332 
Sxy  =  (10.85  -  10.66) (7.84  -  7.764)  +  •  •  •  +  (10.62  -  10.66)(7.74  -  7.764) 
=  -0.1027 


The  correlation  coefficient  therefore  is 


r  = 


-0.1027 

V0. 1646  x  0.13332 


-0.69. 


Since  —0.69  is  negative,  we  can  say  that  (i)  there  is  a  negative  correlation  between 
the  100-m  race  and  the  long  jump  event,  i.e.,  shorter  running  times  result  in  longer 
long  jump  results,  and  (ii)  this  association  is  moderate  to  strong. 

In  R,  we  can  obtain  the  results  (after  attaching  the  data)  as  follows: 


cor (X . 100m , X . Long . j  ump ,  method= ' pear son ' ) 


4.3.3  Spearman's  Rank  Correlation  Coefficient 


Consider  a  situation  where  n  objects  are  ranked  with  respect  to  two  variables  X  and 
Y .  For  instance,  the  variables  could  represent  the  opinion  of  two  different  judges 
in  a  talent  competition  who  rank  the  participants  with  respect  to  their  performance. 
This  means  that  for  each  judge,  the  worst  participant  (with  the  lowest  score  Xi)  is 
assigned  rank  1,  the  second  worst  participant  (with  the  second  lowest  score  Xi)  will 
receive  rank  2,  and  so  on.  Thus,  every  participant  has  been  given  two  ranks  by  two 
different  judges.  Suppose  we  want  to  measure  the  degree  of  association  between  the 
two  different  judgments;  that  is,  the  two  different  sets  of  ranks.  We  expect  that  under 
perfect  agreement,  both  the  judges  give  the  same  judgment  in  the  sense  that  they  give 
the  same  ranks  to  each  candidate.  However,  if  they  are  not  in  perfect  agreement,  then 
there  may  be  some  variation  in  the  ranks  assigned  by  them.  To  measure  the  degree  of 
agreement,  or,  in  general,  the  degree  of  association,  one  can  use  Spearman’s  rank 
correlation  coefficient.  As  the  name  says,  this  correlation  coefficient  uses  only  the 
ranks  of  the  values  and  not  the  values  themselves.  Thus,  this  measure  is  suitable 
for  both  ordinal  and  continuous  variables.  We  introduce  the  following  notations: 
let  R(xi)  denote  the  rank  of  the  i th  observation  on  X,  i.e.  the  rank  jt*  among  the 
ordered  values  of  X.  Similarly,  R(yt)  denotes  the  rank  of  the  i th  observation  of  y. 
The  difference  between  the  two  rank  values  is  di  =  R(xi)  —  R (y/).  Spearman’s  rank 
correlation  coefficient  is  defined  as 


(4.19) 


The  values  of  R  lie  between  —1  and  +1  and  measure  the  degree  of  correlation 
between  the  ranks  of  X  and  Y .  Note  that  it  does  not  matter  whether  we  choose  an 
ascending  or  descending  order  of  the  ranks,  the  value  of  R  remains  the  same.  When 
all  the  observations  are  assigned  exactly  the  same  ranks,  then  R  =  1  and  when  all 
the  observations  are  assigned  exactly  the  opposite  ranks,  then  R  =  —  1. 
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Example  4.3.4  Look  again  at  the  scatter  plots  in  Figs.  4.2  and  4.3.  We  observe 
strong  positive  correlation  in  Fig.  4.2a  ( R  =  0.87),  strong  negative  correlation  in 
Fig.  4.2b  ( R  =  —0.92),  moderate  positive  correlation  in  Fig.  4.2c  ( R  =  0.51),  mod¬ 
erate  negative  association  in  Fig.  4. 2d  (R  =  —0.55),  no  visible  correlation  in  Fig.  4.3a 
( R  =  —0.01),  and  strong  nonlinear  correlation  in  Fig.  4.3b  ( R  =  0.99). 

Example  4.3.5  Let  us  follow  Example  4.3.3  a  bit  further  and  calculate  Spearman’s 
rank  correlation  coefficient  for  the  first  five  observations  of  the  decathlon  data.  Again 
we  list  the  results  of  the  100-m  race  (X)  and  the  results  of  the  long  jump  competition 
(T).  In  addition,  we  assign  ranks  to  both  X  and  Y.  For  example,  the  shortest  time 
receives  rank  1,  whereas  the  longest  time  receives  rank  5.  Similarly,  the  shortest  long 
jump  result  receives  rank  1,  the  longest  long  jump  result  receives  rank  5. 


i 

Xi 

R(xi) 

yi 

R(yi) 

di  df 

Roman  Sebrle 

10.85 

4 

7.84 

4 

0  0 

Bryan  Clay 

10.44 

1 

7.96 

5 

-4  16 

Dmitriy  Karpov 

10.50 

2 

7.81 

3 

-1  1 

Dean  Macey 

10.89 

5 

7.47 

1 

-4  16 

Chiel  Warners 

10.62 

3 

7.74 

2 

-1  1 

Total 

34 

Using  (4.19),  Spearman’s  rank  correlation  coefficient  can  be  calculated  as 


R  =  1  - 


6£?=i  d} 


=  1  - 


6-34 


=  -0.7. 


n(n 2  -  1)  '  5-24 

We  therefore  have  a  moderate  to  strong  negative  association  between  the  100-m 
race  and  the  long  jump  event.  We  now  know  that  for  the  5  athletes  above  longer 
running  times  relate  to  shorter  jumping  distances  which  in  turn  means  that  a  good 
performance  in  one  discipline  implies  a  good  performance  in  the  other  discipline.  In 
R ,  we  can  obtain  the  same  results  by  using  the  cor  command: 


cor (X . 100m , X . Long . j ump ,  method= ' spearman ' ) 


rr 


If  two  or  more  observations  take  the  same  values  for  Xi  (or  y;),  then  there  is  a 
tie.  In  such  situations,  the  respective  ranks  can  simply  be  averaged,  though  more 
complicated  solutions  also  exist  (one  of  which  is  implemented  in  the  R  function 
cor).  For  example,  if  in  Example  4.3.5  Bryan  Clay’s  was  10.50  s  instead  of  10.44  s, 
then  both  Bryan  Clay  and  Dmitriy  Karpov  had  the  same  time.  Instead  of  assigning 
the  ranks  1  and  2  to  them,  we  assign  the  ranks  1.5  to  each  of  them. 

The  differences  between  the  correlation  coefficient  and  the  rank  correlation  coeffi¬ 
cient  are  manifold:  firstly,  Pearson’s  correlation  coefficient  can  be  used  for  continuous 
variables  only,  but  not  for  nominal  or  ordinal  variables.  The  rank  correlation  coeffi¬ 
cient  can  be  used  for  either  two  continuous  or  two  ordinal  variables  or  a  combination 
of  an  ordinal  and  a  continuous  variable,  but  not  for  two  nominal  variables.  More¬ 
over,  the  rank  correlation  coefficient  responds  to  any  type  of  relationship  whereas 
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Pearson’s  correlation  measures  the  degree  of  a  linear  relationship  only — see  also 
Fig.  4.3b.  Another  difference  between  the  two  correlation  coefficients  is  that  Pear¬ 
son  uses  the  entire  information  contained  in  the  continuous  data  in  contrast  to  the 
rank  correlation  coefficient  which  uses  only  ordinal  information  contained  in  the 
ordered  data. 

4.3.4  Measures  Using  Discordant  and  Concordant  Pairs 

Another  concept  which  uses  ranks  to  measure  the  association  between  ordinal  vari¬ 
ables  is  based  on  concordant  and  discordant  observation  pairs.  It  is  best  illustrated 
by  means  of  an  example. 

Example  4.3.6  Suppose  an  online  book  store  conducts  a  survey  on  their  customer’s 
satisfaction  with  respect  to  both  the  timeliness  of  deliveries  (X)  and  payment  options 
(T).  Let  us  consider  the  following  2x3  contingency  table  with  a  summary  of  the 
responses  of  100  customers.  We  assume  that  the  categories  for  both  variables  can 
be  ordered  and  ranks  can  be  assigned  to  different  categories,  see  the  numbers  in 
brackets  in  Table  4.8.  There  are  100  observation  pairs  (jt*,  yt)  which  summarize  the 
response  of  the  customers  with  respect  to  both  X  and  Y.  For  example,  there  are  18 
customers  who  were  unsatisfied  with  the  timeliness  of  the  deliveries  and  complained 
that  there  are  not  enough  payment  options.  If  we  compare  two  responses  (v/, ,  y;, )  and 
(xi2 ,  y;2),  it  might  be  possible  that  one  customer  is  more  happy  (or  more  unhappy) 
than  the  other  customer  with  respect  to  both  X  and  Y  or  that  one  customer  is  more 
happy  with  respect  to  X  but  more  unhappy  with  respect  to  Y  (or  vice  versa).  If  the 
former  is  the  case,  then  this  is  a  concordant  observation  pair;  if  the  latter  is  true,  then 
it  is  a  discordant  pair.  For  instance,  a  customer  who  replied  “enough”  and  “satisfied” 
is  more  happy  than  a  customer  who  replied  “not  enough”  and  “unsatisfied”  because 
he  is  more  happy  with  respect  to  both  X  and  Y. 

In  general,  a  pair  is 

•  concordant  if  12  >  i  \  and  72  >  j\  (or  22  <  h  and  72  <  71), 

•  discordant  if  22  <  h  and  72  >  ji  (or  h  >  h  and  72  <  71), 

•  tied  if  / 1  =  h  (or  71  =  72). 


Table  4.8  Payment  options  and  timeliness  survey  with  100  participating  customers 


Timeliness 

Unsatisfied 

(1) 

Satisfied 

(2) 

Very  satisfied 

(3) 

Total 

Payment 

Not  enough  (1) 

7 

11 

26 

44 

options 

Enough  (2) 

10 

15 

31 

56 

Total 

17 

26 

57 

100 

4.3  Association  Between  Ordinal  and  Continuous  Variables 
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Obviously,  if  we  have  only  concordant  observations,  then  there  is  a  strong  posi¬ 
tive  association  because  a  higher  value  of  X  (in  terms  of  the  ranking)  implies  a 
higher  value  of  Y.  However,  if  we  have  only  discordant  observations,  then  there  is 
a  clear  negative  association.  The  measures  which  are  introduced  below  simply  put 
the  number  of  concordant  and  discordant  pairs  into  relation.  This  idea  is  reflected  in 
Goodman  and  KruskaPs  7  which  is  defined  as 


K  D  _  K  -  D 

7  ~~  K  +  D  ~  K  +  D  ~  K  +  D’ 

where 


(4.20) 


K  = 


Mi  j^mn  1 


D  = 


Mi  jftmn 


i<m  j  <n 


1 <m j >n 


describe  the  number  of  concordant  and  discordant  observation  pairs,  respectively. 
An  alternative  measure  is  Stuart’s  rc  given  as 

2min(/:,  l)(K  —  D) 

rc  =  0  - - .  (4.21) 

n2(mm(k,  /)  —  1) 

Both  measures  are  standardized  to  lie  between  —  1  and  1 ,  where  larger  values  indicate 
a  stronger  association  and  the  sign  indicates  the  direction  of  the  association. 


Example  4.3. 7  Consider  Example  4.3.6.  A  customer  who  replied  “enough”  and  “sat¬ 
isfied”  is  more  happy  than  a  customer  who  replied  “not  enough”  and  “unsatisfied” 
because  the  observation  pairs,  using  ranks,  are  (2,  2)  and  (1,  1)  and  therefore  z'2  >  h 
and  72  >  j\ .  There  are  7  x  15  such  pairs.  Similarly  those  who  said  “not  enough” 
and  “unsatisfied”  are  less  happy  than  those  who  said  “enough”  and  “very  satisfied” 
(7x31  pairs).  Table  4.5  summarizes  the  comparisons  in  detail. 

Table  4.5a  shows  that  (x\,  yi)  =  (not  enough,  unsatisfied)  is  concordant  to 
(X2,  y2)  =  (enough,  satisfied)  and  (X2,  73)  =  (enough,  very  satisfied)  and  tied  to 
(X2,  yi)  =  (enough,  unsatisfied),  (x\,  y2)  =  (not  enough,  satisfied),  and  (x\,  y$)  = 
(not  enough,  very  satisfied).  Thus  for  these  comparisons,  we  have  0  discordant  pairs, 
(7  x  15)  +  (7  x  31)  concordant  pairs  and  7  x  (10  +  11  +  26)  tied  pairs.  Table  4.5b- 
f  show  how  the  task  can  be  completed.  While  tiresome,  systematically  working 
through  the  table  (and  making  sure  to  not  count  pairs  more  than  once)  yields 

K  =  1  x  (15  +  31)  +  11  x  31  =  663 

D  =  10  x  (11  +  26)  +  15  x  26  =  760. 

As  a  visual  rule  of  thumb,  working  from  the  top  left  to  the  bottom  right  yields  the 
concordant  pairs;  and  working  from  the  bottom  left  to  the  top  right  yields  the  discor¬ 
dant  pairs.  It  follows  that  K  =  (663  —  760)/(663  +  760)  ~  —0.07  which  indicates 
no  clear  relationship  between  the  two  variables.  A  similar  result  is  obtained  using 
rc  which  is  4  x  (760  —  663)/1002  ~  0.039.  This  rather  lengthy  task  can  be  made 
much  quicker  by  using  the  ord .  gamma  and  ord .  tau  commands  from  the  R  library 
ryouready: 
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(a) 

y  i 

2/2 

2/3 

X\ 

t 

t 

X2 

t 

c 

c 

(d) 

yi 

2/2 

2/3 

X\ 

t 

d 

d 

X2 

t 

t 

(b) 

y  i 

2/2 

2/3 

X\ 

t 

t 

X2 

d 

t 

c 

(e) 

yi 

2/2 

2/3 

X\ 

c 

t 

d 

X2 

t 

t 

(c) 

2/1 

2/2 

2/3 

X\ 

t 

t 

X2 

d 

d 

t 

(f) 

2/i 

2/2 

2/3 

X\ 

c 

c 

t 

X2 

t 

t 

Fig.  4.5  Scheme  to  visualize  concordant  (c),  discordant  ( d ),  and  tied  (t)  pairs  in  a  2  x  3  contingency 
table 


library (ryouready ) 

ex  <-  matrix(c(7, 11,26, 10, 15,31) ,ncol=3,byrow=T) 

ord . gamma (ex) 

ord.tau(ex) 


4.4  Visualization  of  Variables  from  Different  Scales 

If  we  want  to  jointly  visualize  the  association  between  a  variable  X ,  which  is  either 
nominal  or  ordinal  and  another  variable  Y,  which  is  continuous,  then  we  can  use  any 
graph  which  is  suitable  for  the  continuous  variable  (see  Chaps.  2  and  3)  and  produce 
it  for  each  category  of  the  nominal/ordinal  variable.  We  recommend  using  stratified 
box  plots  or  stratified  ECDF’s,  as  they  are  easy  to  read  when  summarized  in  a  single 
figure;  however,  it  is  also  possible  to  place  histograms  next  to  each  other  or  on  top  of 
each  other,  or  overlay  kernel  density  plots,  but  we  do  not  illustrate  this  here  in  more 
detail. 

Example  4.4.1  Consider  again  our  pizza  delivery  example  (Appendix  A. 4).  If  we 
are  interested  in  the  pizza  delivery  times  by  branch,  we  may  simply  plot  the  box 
plots  and  ECDF’s  of  delivery  time  by  branch.  Figure  4.6  shows  that  the  shortest 
delivery  times  can  be  observed  in  the  branch  in  the  East.  Producing  these  graphs 
in  R  is  straightforward:  The  boxplot  command  can  be  used  for  two  variables  by 
separating  them  with  the  ~  sign.  For  the  ECDF,  we  have  to  produce  a  plot  for  each 
branch  and  overlay  them  with  the  “add=TRUE”  option. 

boxplot  (time^branch) 

plot . ecdf (time [branch== ' East '  ]  ) 

plot . ecdf (time  [branch== ' West ' ] ,  add=TRUE) 

plot . ecdf (time  [branch== ' Centre ' ] ,  add=TRUE) 


Delivery  Time  (in  min) 
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o 

LO 


O 


o 

00 


o 

CM 


Centre 


East 

Branch 


West 


(a)  Using  box  plots 

Fig.  4.6  Distribution  of  pizza  delivery  time  stratified  by  branch 


(b)  Using  ECDF’s 


4.5  Key  Points  and  Further  Issues 


Note: 


/  How  to  use  different  measures  of  association: 

2  nominal  variables  ->  Pearson’s  x* 2,  relative  risks,  odds  ratio, 

Cramer’s  V,  and  CCOrr 

2  ordinal  variables  ->  Spearman’s  rank  correlation  coeffi¬ 
cient,  7,  rc 

2  continuous  variables  — >  Pearson’s  correlation  coefficient, 

Spearman’s  correlation  coefficient 

/  For  two  variables  which  are  measured  on  different  scales,  for  exam¬ 
ple  continuous/ordinal  or  ordinal/nominal,  one  should  use  measures  of 
association  suitable  for  the  less  informative  of  the  two  scales. 

/  Another  graphical  representation  of  both  a  continuous  and  discrete  vari¬ 
able  is  stratified  confidence  interval  plots  (error  plots),  see  Chap.  9. 
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4.6  Exercises 

Exercise  4.1  A  newspaper  asks  two  of  its  staff  to  review  the  coffee  quality  at  different 
trendy  cafes.  The  coffee  can  be  rated  on  a  scale  from  1  (miserable)  to  10  (excellent). 
The  results  of  the  two  coffee  enthusiasts  X  and  Y  are  as  follows: 


Cafe  i 

xi  yi 

1 

3  6 

2 

8  7 

3 

7  10 

4 

9  8 

5 

5  4 

(a)  Calculate  and  interpret  Spearman’s  rank  correlation  coefficient. 

(b)  Does  Spearman’s  R  differ  depending  on  whether  ranks  are  assigned  in  a  decreas¬ 
ing  or  increasing  order? 

(c)  Suppose  the  coffee  can  only  be  rated  as  either  good  (>5)  or  bad  (<5).  Do  the 
chances  of  a  good  rating  differ  between  the  two  journalists? 

Exercise  4.2  A  total  of  150  customers  of  a  petrol  station  are  asked  about  their  satis¬ 
faction  with  their  car  and  motorbike  insurance.  The  results  are  summarized  below: 


Satisfied  Unsatisfied 

Total 

Car 

33 

25 

58 

Car  (diesel  engine) 

29 

31 

60 

Motorbike 

12 

20 

32 

Total 

74 

76 

150 

(a)  Determine  and  interpret  Pearson’s  \2  statistic,  Cramer’s  V,  and  CCOrr- 

(b)  Combine  the  categories  “car”  and  “car  (diesel  engine)”  and  produce  the  corre¬ 
sponding  2x2  table.  Calculate  x2  as  efficiently  as  possible  and  give  a  mean¬ 
ingful  interpretation  of  the  odds  ratio. 

(c)  Compare  the  results  from  (a)  and  (b). 

Exercise  4.3  There  has  been  a  big  debate  about  the  usefulness  of  speed  limits  on 
public  roads.  Consider  the  following  table  which  lists  the  speed  limits  for  country 
roads  (in  miles/h)  and  traffic  deaths  (per  100  million  km)  for  different  countries  in 
1986  when  the  debate  was  particularly  serious: 

(a)  Draw  the  scatter  plot  for  the  two  variables. 

(b)  Calculate  the  Bravais-Pearson  and  Spearman  correlation  coefficients. 
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Country 

Speed  limit  Traffic  deaths 

Denmark 

55 

4.1 

Japan 

55 

4.7 

Canada 

60 

4.3 

Netherlands 

60 

5.1 

Italy 

75 

6.1 

(c)  What  are  the  effects  on  the  correlation  coefficients  if  the  speed  limit  is  given  in 
km/h  rather  than  miles/h  (1  mile/h  ~  1 .61  km/h)? 

(d)  Consider  one  more  observation:  the  speed  limit  for  England  was  70  miles/h  and 
the  death  rate  was  3.1. 

(i)  Add  this  observation  to  the  scatter  plot. 

(ii)  Calculate  the  Bravais-Pearson  correlation  coefficient  given  this  additional 
observation. 

Exercise  4.4  The  famous  passenger  liner  Titanic  hit  an  iceberg  in  1912  and  sank.  A 
total  of  337  passengers  travelled  in  first  class,  285  in  second  class,  and  721  in  third 
class.  In  addition,  there  were  885  staff  members  on  board.  Not  all  passengers  could 
be  rescued.  Only  the  following  were  rescued:  135  from  the  first  class,  160  from  the 
second  class,  541  from  the  third  class  and  674  staff. 

(a)  Determine  and  interpret  the  contingency  table  for  the  variables  “travel  class” 
and  “rescue  status”. 

(b)  Use  a  contingency  table  to  summarize  the  conditional  relative  frequency  distri¬ 
butions  of  rescue  status  given  travel  class.  Could  there  be  an  association  of  the 
two  variables? 

(c)  What  would  the  contingency  table  from  (a)  look  like  under  the  independence 
assumption?  Calculate  Cramer’s  V  statistic.  Is  there  any  association  between 
travel  class  and  rescue  status? 

(d)  Combine  the  categories  “first  class”  and  “second  class”  as  well  as  “third  class” 
and  “staff”.  Create  a  contingency  table  based  on  these  new  categories.  Determine 
and  interpret  Cramer’s  V,  the  odds  ratio,  and  relative  risks  of  your  choice. 

(e)  Given  the  results  from  (a)  to  (d),  what  are  your  conclusions? 

Exercise  4.5  To  study  the  association  of  the  monthly  average  temperature  (in  °C, 
X)  and  hotel  occupation  (in  %,  T),  we  consider  data  from  three  cities:  Polenca 
(Mallorca,  Spain)  as  a  summer  holiday  destination,  Davos  (Switzerland)  as  a  winter 
skiing  destination,  and  Basel  (Switzerland)  as  a  business  destination. 
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Fig.  4.7  Temperature  and  hotel  occupancy  for  the  different  cities 


Months 

Davos 

Polenca 

Basel 

X 

Y 

X 

Y 

X 

Y 

Jan 

—6 

91 

10 

13 

1 

23 

Feb 

-5 

89 

10 

21 

0 

82 

Mar 

2 

76 

14 

42 

5 

40 

Apr 

4 

52 

17 

64 

9 

45 

May 

7 

42 

22 

79 

14 

39 

Jun 

15 

36 

24 

81 

20 

43 

Jul 

17 

37 

26 

86 

23 

50 

Aug 

19 

39 

27 

92 

24 

95 

Sep 

13 

26 

22 

36 

21 

64 

Oct 

9 

27 

19 

23 

14 

78 

Nov 

4 

68 

14 

13 

9 

9 

Dec 

0 

92 

12 

41 

4 

12 

(a)  Calculate  the  Bravais-Pearson  correlation  coefficient.  The  following  summary 
statistics  are  available:  X/=i  xiYi  =  22,  776,  x  =  12.22,  y  =  51.28,  ^  =  76.95, 
and  Sy  =  706.98. 

(b)  Interpret  the  scatter  plot  in  Fig.  4.7  which  visualizes  temperature  and  hotel  occu¬ 
pancy  for  Davos  (D),  Polenca  (P),  and  Basel  (B). 

(c)  Use  R  to  calculate  the  correlation  coefficient  separately  for  each  city.  Interpret 
the  results  and  discuss  the  use  of  the  correlation  coefficient  if  more  than  two 
variables  are  available. 
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Exercise  4.6  Consider  a  neighbourhood  survey  on  the  use  of  a  local  park.  Respon¬ 
dents  were  asked  whether  the  park  may  be  used  for  summer  music  concerts  and 
whether  dog  owners  should  put  their  dogs  on  a  lead.  The  results  are  summarized  in 
the  following  contingency  table: 


Put  dogs  on  a  lead 

Agree  No  opinion  Disagree 

Total 

Use  for  concerts  Agree 

82 

4 

0 

86 

No  opinion 

8 

43 

9 

60 

Disagree 

0 

2 

10 

12 

Total 

90 

49 

19 

158 

(a)  Calculate  and  interpret  Goodman  and  Kruskal’s  7. 

(b)  Now  ignore  the  ordinal  structure  of  the  data  and  calculate  Cramer’s  V. 

(c)  Create  the  contingency  table  which  is  obtained  when  the  categories  “no  opinion” 
and  “agree”  are  combined. 

(d)  What  is  the  relative  risk  of  disagreement  with  summer  concerts  depending  on 
the  opinion  about  using  leads? 

(e)  Calculate  the  odds  ratio  and  offer  two  interpretations  of  it. 

(f)  Determine  7  for  the  table  calculated  in  (c). 

(g)  What  is  your  final  interpretation  and  what  may  be  the  best  measure  to  use  in  this 
example? 

Exercise  4.7  Consider  n  observations  for  which  yi  =  a  +  bxi ,  b  >  0,  holds.  Show 

that  r  —  1 . 


Exercise  4.8  Make  yourself  familiar  with  the  Olympic  decathlon  data  described  in 

Appendix  A. 4.  Read  in  and  attach  the  data  in  R. 

(a)  Use  R  to  calculate  and  interpret  the  Bravais-Pearson  correlation  coefficient 
between  the  results  of  the  discus  and  the  shot-put  events. 

(b)  There  are  10  continuous  variables.  How  many  different  correlation  coefficients 
can  you  calculate?  How  would  you  summarize  them? 

(c)  Apply  the  cor  command  to  the  whole  data  and  interpret  the  output. 

(d)  Omit  the  two  rows  which  contain  missing  data  and  interpret  the  output  again. 

Exercise  4.9  We  are  interested  in  the  pizza  delivery  data  which  is  described  in 

Appendix  A.4. 

(a)  Read  in  the  data  and  create  two  new  binary  variables  which  describe  whether 
a  pizza  was  hot  (>65  °C)  and  the  delivery  time  was  short  (<30min).  Create  a 
contingency  table  for  the  two  new  variables. 

(b)  Calculate  and  interpret  the  odds  ratio  for  the  contingency  table  from  (a). 
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(c)  Use  Cramer’s  V,  Stuart’s  rc,  Goodman  and  Kruskal’s  7,  and  a  stacked  bar  chart  to 
explore  the  association  between  the  categorical  time  and  temperature  variables. 

(d)  Draw  a  scatter  plot  for  the  continuous  time  and  temperature  variables.  Determine 
both  the  Bravais-Pearson  and  Spearman  correlation  coefficients. 

(e)  Use  methods  of  your  choice  to  explore  the  relationship  between  temperature 
and  driver,  operator,  number  of  ordered  pizzas  and  bill.  Is  it  clear  which  of  the 
variables  influence  the  pizza  temperature? 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  345 


Part 


Probability  Calculus 


Combinatorics 


5.1  Introduction 

Combinatorics  is  a  special  branch  of  mathematics.  It  has  many  applications  not 
only  in  several  interesting  fields  such  as  enumerative  combinatorics  (the  classical 
application),  but  also  in  other  fields,  for  example  in  graph  theory  and  optimization. 

First,  we  try  to  motivate  and  understand  the  role  of  combinatorics  in  statistics. 
Consider  a  simple  example  in  which  someone  goes  to  a  cafe.  The  person  would 
like  a  hot  beverage  and  a  cake.  Assume  that  one  can  choose  among  three  different 
beverages,  for  example  cappuccino,  hot  chocolate,  and  green  tea,  and  three  different 
cakes,  let  us  say  carrot  cake,  chocolate  cake,  and  lemon  tart.  The  person  may  consider 
different  beverage  and  cake  combinations  when  placing  the  order,  for  example  carrot 
cake  and  cappuccino,  carrot  cake  and  tea,  and  hot  chocolate  and  lemon  tart.  From  a 
statistical  perspective,  the  customer  is  evaluating  the  possible  combinations  before 
making  a  decision.  Depending  on  their  preferences,  the  order  will  be  placed  by 
choosing  one  of  the  combinations. 

In  this  example,  it  is  easy  to  calculate  the  number  of  possible  combinations. 
There  are  three  different  beverages  and  three  different  cakes  to  choose  from,  leading 
to  nine  different  (3  x  3)  beverage  and  cake  combinations.  However,  suppose  there 
is  a  choice  of  15  hot  beverages  and  8  different  cakes.  How  many  orders  can  be 
made?  (Answer:  15x8)  What  if  the  person  decides  to  order  two  cakes,  how  will  it 
affect  the  number  of  possible  combinations  of  choices?  It  will  be  a  tedious  task  to 
count  all  the  possibilities.  So  we  need  a  systematic  approach  to  count  such  possible 
combinations.  Combinatorics  deals  with  the  counting  of  different  possibilities  in  a 
systematic  approach. 

People  often  use  the  urn  model  to  understand  the  system  in  the  counting  process. 
The  urn  model  deals  with  the  drawing  of  balls  from  an  urn.  The  balls  in  the  urn 
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(a)  (b)  (c) 


m  MJ-Ut-JM 

Fig.  5.1  a  Representation  of  the  urn  model.  Drawing  from  the  urn  model  b  with  replacement  and  c 
without  replacement.  Compositions  of  three  drawn  balls:  d  all  balls  are  distinguishable  and  e  some 
balls  are  not  distinguishable 

represent  the  units  of  a  population,  or  the  features  of  a  population.  The  balls  may 
vary  in  colour  or  size  to  represent  specific  properties  of  a  unit  or  feature.  We  illustrate 
this  concept  in  more  detail  in  Fig.  5.1. 

Suppose  there  are  5  balls  of  three  different  colours — two  black,  one  grey,  and  two 
white  (see  Fig.  5.1a).  This  can  be  generalized  to  a  situation  in  which  there  are  n  balls 
in  the  urn  and  we  want  to  draw  m  balls.  Suppose  we  want  to  know 

•  how  many  different  possibilities  exist  to  draw  m  out  of  n  balls  (thus  determining 
the  number  of  distinguishable  combinations). 

To  deal  with  such  a  question,  we  first  need  to  decide  whether  a  ball  will  be  put  back 
into  the  urn  after  it  is  drawn  or  not.  Figure 5. lb  illustrates  that  a  grey  ball  is  drawn 
from  the  urn  and  then  placed  back  (illustrated  by  the  two-headed  arrow).  We  say  the 
ball  is  drawn  with  replacement.  Figure 5.1c  illustrates  a  different  situation  in  which 
the  grey  ball  is  drawn  from  the  urn  and  is  not  placed  back  into  the  urn  (illustrated  by 
the  one-headed  arrow).  We  say  the  ball  is  drawn  without  replacement. 

Further,  we  may  be  interested  in  knowing  the 

•  total  number  of  ways  in  which  the  chosen  set  of  balls  can  be  arranged  in  a  distin¬ 
guishable  order  (which  we  will  define  as  permutations  later  in  this  chapter). 

To  answer  the  question  how  many  permutations  exist,  we  first  need  to  decide  whether 
all  the  chosen  balls  are  distinguishable  from  each  other  or  not.  For  example,  in 
Fig.  5. Id,  the  three  chosen  balls  have  different  colours;  therefore,  they  are  distin¬ 
guishable.  There  are  many  options  on  how  they  can  be  arranged.  In  contrast,  some 
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of  the  chosen  balls  in  Fig.5.1e  are  the  same  colour,  they  are  therefore  not  distin¬ 
guishable.  Consequently,  the  number  of  combinations  is  much  more  limited.  The 
concept  of  balls  and  urns  just  represents  the  features  of  observations  from  a  sample. 
We  illustrate  this  in  more  detail  in  the  following  example. 

Example  5.1.1  Say  a  father  promises  his  daughter  three  scoops  of  ice  cream  if  she 
cleans  up  her  room.  For  simplicity,  let  us  assume  the  daughter  has  a  choice  of  four 
flavours:  chocolate,  banana,  cherry,  and  lemon.  How  many  different  choices  does  the 
daughter  have?  If  each  scoop  has  to  be  a  different  flavour  she  obviously  has  much 
less  choice  than  if  the  scoops  can  have  the  same  flavour.  In  the  urn  model,  this  is  rep¬ 
resented  by  the  concept  of  “with/without  replacement”.  The  urn  contains  4  balls  of  4 
different  colours  which  represent  the  ice  cream  flavours.  For  each  of  the  three  scoops, 
a  ball  is  drawn  to  determine  the  flavour.  If  we  draw  with  replacement,  each  flavour 
can  be  potentially  chosen  multiple  times;  however,  if  we  draw  without  replacement 
each  flavour  can  be  chosen  only  once.  Then,  the  number  of  possible  combinations 
is  easy  to  calculate:  it  is  4,  i.e.  (chocolate,  banana,  and  cherry);  (chocolate,  banana, 
and  lemon);  (chocolate,  cherry,  and  lemon);  and  (banana,  cherry,  and  lemon).  But 
what  if  we  have  more  choices?  Or  if  we  can  draw  flavours  multiple  times?  We  then 
need  calculation  rules  which  help  us  counting  the  number  of  options. 

Now,  let  us  assume  that  the  daughter  picked  the  flavours  (chocolate  [C],  banana 
[B],  and  lemon  [L]).  Like  many  other  children,  she  prefers  to  eat  her  most  favourite 
flavour  (chocolate)  last,  and  her  least  favourite  flavour  (cherry)  first.  Therefore,  the 
order  in  which  the  scoops  are  placed  on  top  of  the  cone  are  important!  In  how 
many  different  ways  can  the  scoops  be  placed  on  top  of  the  cone?  This  relates 
to  the  question  of  the  number  of  distinguishable  permutations.  The  answer  is  6: 
(C,B,L)-(C,L,B)-(B,L,C)-(B,C,L)-(L,B,C)-(L,C,B).  But  what  if  the  daughter  did 
pick  a  flavour  multiple  times,  e.g.  (chocolate,  chocolate,  lemon)?  Since  the  two 
chocolate  scoops  are  non-distinguishable,  there  are  fewer  permutations:  (chocolate, 
chocolate,  and  lemon)-(chocolate,  lemon,  and  chocolate)-(lemon,  chocolate,  and 
chocolate). 

The  bottom  line  of  this  example  is  that  the  number  of  combinations/options  is 
determined  by  (i)  whether  we  draw  with  or  without  replacement  (i.e.  allow  flavours 
to  be  chosen  more  than  once)  and  (ii)  whether  the  arrangement  in  a  particular  order 
(^permutation)  is  of  any  specific  interest. 

Consider  the  urn  example  again.  Suppose  three  balls  of  different  colours,  black, 
grey,  and  white,  are  drawn.  Now  there  are  two  options:  The  first  option  is  to  take  into 
account  the  order  in  which  the  balls  are  drawn.  In  such  a  situation,  two  possible  sets 
of  balls  such  as  (black,  grey,  and  white)  and  (white,  black,  and  grey)  constitute  two 
different  sets.  Such  a  set  is  called  an  ordered  set.  In  the  second  option,  we  do  not 
take  into  account  the  order  in  which  the  balls  are  drawn.  In  such  a  situation,  the  two 
possible  sets  of  balls  such  as  (black,  grey,  and  white)  and  (white,  black,  and  grey) 
are  the  same  sets  and  constitute  an  unordered  set  of  balls. 
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Definition  5.1.1  A  group  of  elements  is  said  to  be  ordered  if  the  order  in  which 

these  elements  are  drawn  is  of  relevance.  Otherwise,  it  is  called  unordered. 

Examples. 

•  Ordered  samples: 

-  The  first  three  places  in  an  Olympic  100  m  race  are  determined  by  the  order 
in  which  the  athletes  arrive  at  the  finishing  line.  If  8  athletes  are  competing 
with  each  other,  the  number  of  possible  results  for  the  first  three  places  is  of 
interest.  In  the  urn  language,  we  are  taking  draws  without  replacement  (since 
every  athlete  can  only  have  one  distinct  place). 

-  In  a  raffle  with  two  prizes,  the  first  drawn  raffle  ticket  gets  the  first  prize  and  the 
second  raffle  ticket  gets  the  second  prize. 

-  There  exist  various  esoteric  tarot  card  games  which  claim  to  foretell  someone’s 
fortune  with  respect  to  several  aspects  of  life.  The  order  in  which  the  cards  are 
shown  on  the  table  is  important  for  the  interpretation. 

•  Unordered  samples: 

-  The  selected  members  for  a  national  football  team.  The  order  in  which  the 
selected  names  are  announced  is  irrelevant. 

-  Out  of  10  economists,  10  medical  doctors,  and  10  statisticians,  an  advisory 
committee  consisting  of  4  economists,  3  medical  doctors,  and  2  statisticians  is 
elected. 

-  Fishing  20  fish  from  a  lake. 

-  A  bunch  of  10  flowers  made  from  21  flowers  of  4  different  colours. 


Definition  5.1.2  The  factorial  function  nl  is  defined  as 


f  = 


ni  = 


1  for  n  =  0 

1-2-3  •  •  -  n  for  n  >  0. 


(5.1) 


Example  5.1.2  It  follows  from  the  definition  of  the  factorial  function  that 

0!  =  1,  1!  =  1  21  =  1-2  =  2,  31  =  1-2-3  =  6. 


This  can  be  calculated  in  R  as  follows: 
factorial (n) 
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5.2  Permutations 

Definition  5.2.1  Consider  a  set  of  n  elements.  Each  ordered  composition  of  these  n 
elements  is  called  a  permutation. 

We  distinguish  between  two  cases:  If  all  the  elements  are  distinguishable,  then  we 
speak  of  permutation  without  replacement .  However,  if  some  or  all  of  the  elements 
are  not  distinguishable,  then  we  speak  of  permutation  with  replacement .  Please  note 
that  the  meaning  of  “replacement”  here  is  just  a  convention  and  does  not  directly 
refer  to  the  drawings,  e.g.  from  the  urn  model  considered  in  Example  5.1.1. 


5.2.1  Permutations  without  Replacement 

If  all  the  n  elements  are  distinguishable,  then  there  are 

n\ 

different  compositions  of  these  elements. 


(5.2) 


Example  5.2.1  There  were  three  candidate  cities  for  hosting  the  2020  Olympic 
Games:  Tokyo  (T),  Istanbul  (I),  and  Madrid  (M).  Before  the  election,  there  were 
3!  =  6  possible  outcomes,  regarding  the  final  rankings  of  the  cities: 

(M,  T,  I),  (M,  I,  T),  (T,  M,  I),  (T,  I,  M),  (I,M,T),  (I,T,M). 


5.2.2  Permutations  with  Replacement 

Assume  that  not  all  n  elements  are  distinguishable.  The  elements  are  divided  into 
groups,  and  these  groups  are  distinguishable.  Suppose,  there  are  s  groups  of  sizes 
n\,n2,  . . . ,  ns.  The  total  number  of  different  ways  to  arrange  the  n  elements  in  s 
groups  is: 

n\ 

-  .  (5.3) 

n\\  n 2!  n 3!  •  •  •  ns\ 

Example  5.2.2  Consider  the  data  in  Fig.  5.  le.  There  are  two  groups  consisting  of  two 
black  balls  (n\  =2)  and  one  white  ball  (112  =  1).  So  there  are  the  following  three 
possible  combinations  to  arrange  the  balls:  (black,  black,  and  white),  (black,  white, 
and  black),  and  (white,  black,  and  black).  This  can  be  determined  by  calculating 


3! 


3-2-1 
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5.3  Combinations 

Definition  5.3.1  The  Binomial  coefficient  for  any  integers  m  and  n  with  n  >  m  >  0 
is  denoted  and  defined  as 

( 

It  is  read  as  “n  choose  m”  and  can  be  calculated  in  R  using  the  following  command: 
choose (n,m) 


) 


n\ 


m !  (n  —  m)\ 


(5.4) 


There  are  several  calculation  rules  for  the  binomial  coefficient: 

©*‘  C)-‘  ('•)■(-:-)■  “ 

We  now  answer  the  question  of  how  many  different  possibilities  exist  to  draw  m 
out  of  n  elements,  i.e.  m  out  of  n  balls  from  an  urn.  It  is  necessary  to  distinguish 
between  the  following  four  cases: 


(1)  Combinations  without  replacement  and  without  consideration  of  the  order  of 
the  elements. 

(2)  Combinations  without  replacement  and  with  consideration  of  the  order  of  the 
elements. 

(3)  Combinations  with  replacement  and  without  consideration  of  the  order  of  the 
elements. 

(4)  Combinations  with  replacement  and  with  consideration  of  the  order  of  the  ele¬ 
ments. 


5.3.1  Combinations  without  Replacement  and  without 
Consideration  of  the  Order 


When  there  is  no  replacement  and  the  order  of  the  elements  is  also  not  relevant,  then 
the  total  number  of  distinguishable  combinations  in  drawing  m  out  of  n  elements  is 


(5.6) 


Example  5.3.1  Suppose  a  company  elects  a  new  board  of  directors.  The  board  con¬ 
sists  of  5  members  and  15  people  are  eligible  to  be  elected.  How  many  combinations 
for  the  board  of  directors  exist?  Since  a  person  cannot  be  elected  twice,  we  have  a 
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situation  where  there  is  no  replacement.  The  order  is  also  of  no  importance:  either 
one  is  elected  or  not.  We  can  thus  apply  (5.6)  which  yields 


=  3003 


possible  combinations.  This  result  can  be  obtained  in  R  by  using  the  command 
choose (15 , 5) . 


5.3.2  Combinations  without  Replacement  and  with  Consideration 
of  the  Order 

The  total  number  of  different  combinations  for  the  setting  without  replacement  and 
with  consideration  of  the  order  is 

n\  ( n\ 

-  =  (  Ira!  .  (5.7) 

(n  —  m)\  \rn ) 

Example  5.3.2  Consider  a  horse  race  with  12  horses.  A  possible  bet  is  to  forecast 
the  winner  of  the  race,  the  second  horse  of  the  race,  and  the  third  horse  of  the  race. 
The  total  number  of  different  combinations  for  the  horses  in  the  first  three  places  is 

121 

- - —  =  12  •  11  •  10  =  1320 . 

(12-3)! 

This  result  can  be  explained  intuitively:  for  the  first  place,  there  is  a  choice  of  12 
different  horses.  For  the  second  place,  there  is  a  choice  of  11  different  horses  (12 
horses  minus  the  winner).  For  the  third  place,  there  is  a  choice  of  10  different  horses 
(12  horses  minus  the  first  and  second  horses).  The  total  number  of  combinations  is 
the  product  12  •  11  •  10.  This  can  be  calculated  in  R  as  follows: 

12*11*10  U 


5.3.3  Combinations  with  Replacement  and  without  Consideration 
of  the  Order 

The  total  number  of  different  combinations  with  replacement  and  without  consider¬ 
ation  of  the  order  is 

n  +  ra  —  1  \  (n  +  m  —  1)!  /n+m  —  1 

ra  /  m\  (n  —  1)!  \  n  —  1 

Note  that  these  are  the  two  representations  which  follow  from  the  definition  of  the 
binomial  coefficient  but  typically  only  the  first  representation  is  used  in  textbooks. 
We  will  motivate  the  second  representation  after  Example  5.3.3. 


) 


(5.8) 
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Example  5.3.3  A  farmer  has  2  fields  and  aspires  to  cultivate  one  out  of  4  different 
organic  products  per  field.  Then,  the  total  number  of  choices  he  has  is 

4  +  2-  1\  /5\  5!  3!  -4-5 

)  =  (  )  =  - = - =  10.  (5.9) 

2  )  \2J  2!  3!  1-2-3! 

If  4  different  organic  products  are  denoted  as  a,  b,  c,  and  d,  then  the  following 
combinations  are  possible: 

(a,  a)  (a,  b)  (a,  c)  (a,  d) 

(b,  b)  (b,  c)  (b,  d) 

(c,  c)  (c,  d) 

(d,  d) 

Please  note  that,  for  example,  (a,b)  is  identical  to  (b,a)  because  the  order  in  which 
the  products  a  and  b  are  cultivated  on  the  first  or  second  field  is  not  important  in  this 
example. 

We  now  try  to  give  an  intuitive  explanation  of  formula  (5.9)  using  Example  5.3.3. 
We  have  n  =  4  products  and  m  =  2  fields  and  apply  the  following  technical  “trick”: 
we  sort  the  combinations  by  the  product  symbols  (a,  b,  c,  or  d).  When  we  switch  from 
one  product  to  the  next  (e.g.  from  b  to  c),  we  make  a  note  by  adding  a  vertical  line 
| .  Whenever  a  product  is  skipped,  we  add  a  line  too.  For  example,  the  combination 
(a,  c )  is  denoted  by  a\\c\,  the  combination  (d,  d)  by  1 1 1 dd,  ( c ,  c )  by  | \cc\,  and  ( a ,  a) 
by  aa  1 1 1 .  Therefore,  the  number  of  characters  equates  to  the  2  chosen  symbols  of  the 
set  (a,  b,  c,  d)  plus  the  3  vertical  lines,  in  summary  (4  +  2)  —  1  =  5  places  where 
3  =  ft  —  1  places  are  selected  for  the  vertical  line  | .  How  many  different  line/letter 
combinations  exist?  There  are  3  out  of  5  possible  positions  for  | ,  i.e.  (3)  =  10  possible 
combinations,  and  this  is  nothing  but  the  right-hand  side  of  (5.9). 


5.3.4  Combinations  with  Replacement  and  with  Consideration  of 
the  Order 

The  total  number  of  different  combinations  for  the  integers  m  and  n  with  replacement 
and  when  the  order  is  of  relevance  is 


nm  .  (5.10) 

Example  5.3.4  Consider  a  credit  card  with  a  four-digit  personal  identification  num¬ 
ber  (PIN)  code.  The  total  number  of  possible  combinations  for  the  PIN  is 

nm  =  104  =  10,  000. 

Note  that  every  digit  in  the  first,  second,  third,  and  fourth  places  (m  =  4)  can  be 
chosen  out  of  ten  digits  from  0  to  9  (ft  =  10). 
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5.4  Key  Points  and  Further  Issues 


Note: 


/  The  rules  of  combinatorics  are  as  follows: 


Combinations 

without  replacement  with  replacement 

without  order 

1 

Km) 

i 

/n  +  m  —  1\ 

\  m  ) 

with  order 

1 

{ n \ 
Km) 

|  ml 

nm 

/  Combinations  with  and  without  replacement  are  also  often  called  com¬ 
binations  with  and  without  repetition. 

/  The  permutation  rules  are  as  follows: 


without  replacement  with  replacement 

Permutations 

n\ 

nl  - 

n\  \  •  •  •  ns\ 

5.5  Exercises 

Exercise  5.1  At  a  party  with  10  guests,  every  guest  shakes  hands  with  each  other 
guest.  How  many  handshakes  can  be  counted  in  total? 

Exercise  5.2  A  language  teacher  is  concerned  about  the  vocabularies  of  his  students. 
He  thus  tests  5  students  in  each  lecture.  What  are  the  total  number  of  possible 
combinations 

(a)  if  a  student  is  tested  only  once  per  lecture  and 

(b)  if  a  student  is  tested  more  than  once  per  lecture? 

Use  R  to  quantify  numbers  which  you  cannot  calculate  manually. 

Exercise  5.3  “Gobang”  is  a  popular  game  in  which  two  players  set  counters  on  a 
board  with  381  knots.  One  needs  to  place  5  consecutive  counters  in  a  row  to  win 
the  game.  There  are  also  rules  on  how  to  remove  counters  from  the  other  player. 
Consider  a  match  where  64  counters  have  already  been  placed  on  the  board.  How 
many  possible  combinations  exist  to  place  64  counters  on  the  board? 
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Exercise  5.4  A  shop  offers  a  special  tray  of  beer:  “Munich’s  favourites”.  Customers 


are  allowed  to  fill  the  tray,  which  holds  20  bottles,  with  any  combination  of  Munich’s 
6  most  popular  beers  (from  6  different  breweries). 

(a)  What  are  the  number  of  possible  combinations  to  fill  the  tray? 

(b)  A  customer  insists  of  having  at  least  one  beer  from  each  brewery  in  his  tray. 
How  many  options  does  he  have  to  fill  the  tray? 

Exercise  5.5  The  FIFA  World  Cup  2018  in  Russia  consists  of  32  teams.  How  many 
combinations  for  the  top  3  teams  exist  when 

(a)  taking  into  account  the  order  of  these  top  3  teams  and 

(b)  without  taking  into  account  the  order  of  these  top  3  teams? 

Exercise  5.6  An  online  book  store  assigns  membership  codes  to  each  member.  For 
administrative  reasons,  these  codes  consist  of  four  letters  between  “A”  and  “L”. 
A  special  discount  period  increased  the  total  number  of  members  from  18,  200  to 
20,  500.  Are  there  enough  combinations  of  codes  left  to  be  assigned  for  the  new 
membership  codes? 

Exercise  5. 7  In  the  old  scoring  system  of  ice  skating  (valid  until  2004),  each  member 
of  a  jury  of  9  people  judged  the  performance  of  the  skaters  on  a  scale  between  0  and 
6.  It  was  a  decimal  scale  and  thus  scores  such  as  5.1  and  5.2  were  possible.  Calculate 
the  number  of  possible  score  combinations  from  the  jury. 

Exercise  5.8  It  is  possible  in  Pascal’s  triangle  (Fig.  5.2,  left)  to  view  each  entry  as 
the  sum  of  the  two  entries  directly  above  it.  For  example,  the  3  on  the  fourth  line 


1  5  10  10  5  1 


1  4  6  4  1 


13  3  1 


1  2  1 


1  1 


1 


Fig.  5.2  Excerpt  from  Pascal’s  triangle  (left)  and  its  representation  by  means  of  binomial  coeffi¬ 
cients  (right) 
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from  the  top  is  the  sum  of  the  1  and  2  above  the  3.  Another  interpretation  refers 
to  a  geometric  representation  of  the  binomial  coefficient,  (^)  (Fig.  5.2,  right)  with 
k  =  0,  1,  2,  . . .  being  the  column  index  and  n  =  0,  1,  2,  . . .  being  the  row  index. 

(a)  Show  that  each  entry  in  the  bold  third  diagonal  line  can  be  represented  via  (") . 

(b)  Now  show  that  the  sum  of  two  consecutive  entries  in  the  bold  third  diagonal  line 
always  corresponds  to  quadratic  numbers. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  358 


Elements  of  Probability  Theory 


Let  us  first  consider  some  simple  examples  to  understand  the  need  for  probability 
theory.  Often  one  needs  to  make  a  decision  whether  to  carry  an  umbrella  or  not  when 
leaving  the  house;  a  company  might  wonder  whether  to  introduce  a  new  advertise¬ 
ment  to  possibly  increase  sales  or  to  continue  with  their  current  advertisement;  or 
someone  may  want  to  choose  a  restaurant  based  on  where  he  can  get  his  favourite 
dish.  In  all  these  situations,  randomness  is  involved.  For  example,  the  decision  of 
whether  to  carry  an  umbrella  or  not  is  based  on  the  possibility  or  chance  of  rain. 
The  sales  of  the  company  may  increase,  decrease,  or  remain  unchanged  with  a  new 
advertisement.  The  investment  in  a  new  advertising  campaign  may  therefore  only 
be  useful  if  the  probability  of  its  success  is  higher  than  that  of  the  current  adver¬ 
tisement.  Similarly,  one  may  choose  the  restaurant  where  one  is  most  confident  of 
getting  the  food  of  one’s  choice.  In  all  such  cases,  an  event  may  be  happening  or  not 
and  depending  on  its  likelihood,  actions  are  taken.  The  purpose  of  this  chapter  is  to 
learn  how  to  calculate  such  likelihoods  of  events  happening  and  not  happening. 


6.1  Basic  Concepts  and  Set  Theory 

A  simple  (not  rigorous)  definition  of  a  random  experiment  requires  that  the  exper¬ 
iment  can  be  repeated  any  number  of  times  under  the  same  set  of  conditions,  and 
its  outcome  is  known  only  after  the  completion  of  the  experiment.  A  simple  and 
classical  example  of  a  random  experiment  is  the  tossing  of  a  coin  or  the  rolling  of  a 
die.  When  tossing  a  coin,  it  is  unknown  what  the  outcome  will  be,  head  or  tail,  until 
the  coin  is  tossed.  The  experiment  can  be  repeated  and  different  outcomes  may  be 
observed  in  each  repetition.  Similarly,  when  rolling  a  die,  it  is  unknown  how  many 
dots  will  appear  on  the  upper  surface  until  the  die  is  rolled.  Again,  the  die  can  be 
rolled  repeatedly  and  different  numbers  of  dots  are  obtained  in  each  trial.  A  possible 
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outcome  of  a  random  experiment  is  called  a  simple  event  (or  elementary  event) 
and  denoted  by  cc/.  The  set  of  all  possible  outcomes,  {uq,  uq, . . . ,  uq},  is  called  the 
sample  space  and  is  denoted  as  Q,  i.e.  Q  =  {uo i,  cj 2, . . . ,  uq}.  Subsets  of  Q  are 
called  events  and  are  denoted  by  capital  letters  such  as  A,  B,  C.  The  set  of  all  simple 
events  that  are  contained  in  the  event  A  is  denoted  by  The  event  A  refers  to 
the  non-occurring  of  A  and  is  called  a  composite  or  complementary  event.  Also 
Q  is  an  event.  Since  it  contains  all  possible  outcomes,  we  say  that  Q  will  always 
occur  and  we  call  it  a  sure  event  or  certain  event.  On  the  other  hand,  if  we  consider 
the  null  set  0  =  {}  as  an  event,  then  this  event  can  never  occur  and  we  call  it  an 
impossible  event.  The  sure  event  therefore  is  the  set  of  all  elementary  events,  and 
the  impossible  event  is  the  set  with  no  elementary  events. 

The  above  concepts  of  “events”  form  the  basis  of  a  definition  of  “probability”. 
Once  we  understand  the  concept  of  probability,  we  can  develop  a  framework  to  make 
conclusions  about  the  population  of  interest,  using  a  sample  of  data. 

Example  6.1.1  ( Rolling  a  die )  If  a  die  is  rolled  once,  then  the  possible  outcomes 
are  the  number  of  dots  on  the  upper  surface:  1,  2,  . . . ,  6.  Therefore,  the  sample 
space  is  the  set  of  simple  events  =  “1”,  uj2  =  “2”,  . . . ,  uj c  =  “6”  and  Q  = 
{00 \ ,  cc2, . . . ,  cc6}.  Any  subset  of  Q  can  be  used  to  define  an  event.  For  example,  an 
event  A  may  be  “an  even  number  of  dots  on  the  upper  surface  of  the  die”.  There  are 
three  possibilities  that  this  event  occurs:  uq,  CC4,  or  CC6.  If  an  odd  number  shows  up, 
then  the  composite  event  A  occurs  instead  of  A.  If  an  event  is  defined  to  observe  only 
one  particular  number,  say  uj\  =  “1”,  then  it  is  an  elementary  event.  An  example  of 
a  sure  event  is  “a  number  which  is  greater  than  or  equal  to  1”  because  any  number 
between  1  and  6  is  greater  than  or  equal  to  1.  An  impossible  event  is  “the  number  is 
7”. 


Example  6.1.2  ( Rolling  two  dice )  Suppose  we  throw  two  dice  simultaneously  and 
an  event  is  defined  as  the  “number  of  dots  observed  on  the  upper  surface  of  both  the 
dice”;  then,  there  are  36  simple  events  defined  as  (number  of  dots  on  first  die,  number 
of  dots  on  second  die),  i.e.  U)\  =  (1,  1),  U2  =  (1,  2), . . . ,  tc36  =  (6,  6).  Therefore  £2 
is 


{(1,1),  (1,2),  (1,3),  (1,4),  (1,5),  (1,6) 

(2.1) ,  (2,2),  (2,3),  (2,4),  (2,5),  (2,6) 

(3.1) ,  (3,2),  (3,3),  (3,4),  (3,5),  (3,6) 

(4.1) ,  (4,2),  (4,3),  (4,4),  (4,5),  (4,6) 

(5.1) ,  (5,2),  (5,3),  (5,4),  (5,5),  (5,6) 

(6.1) ,  (6,2),  (6,3),  (6,4),  (6,5),  (6,6)}. 


One  can  define  different  events  and  their  corresponding  sample  spaces.  For  exam¬ 
ple,  if  an  event  A  is  defined  as  “upper  faces  of  both  the  dice  contain  the  same  number 
of  dots”,  then  the  sample  space  is  £2  a  =  {(1,  1),  (2,  2),  (3,  3),  (4, 4),  (5,  5),  (6,  6)}. 
If  another  event  B  is  defined  as  “the  sum  of  numbers  on  the  upper  faces  is  6”,  then 
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Fig.  6.1  AU  B  and  AH  B* 


Fig. 6.2  A\B  and 

A  =  I2\A* 


A 

the  sample  space  is  £2b  =  {(1,  5),  (2,  4),  (3,  3),  (4,  2),  (5,  1)}.  A  sure  event  is  “get 
either  an  even  number  or  an  odd  number”;  an  impossible  event  would  be  “the  sum 
of  the  two  dice  is  greater  than  13”. 

It  is  possible  to  view  events  as  sets  of  simple  events.  This  helps  to  determine  how 
different  events  relate  to  each  other.  A  popular  technique  to  visualize  this  approach  is 
to  use  Venn  diagrams.  In  Venn  diagrams,  two  or  more  sets  are  visualized  by  circles. 
Overlapping  circles  imply  that  both  events  have  one  or  more  identical  simple  events. 
Separated  circles  mean  that  none  of  the  simple  events  of  event  A  are  contained  in 
the  sample  space  of  B.  We  use  the  following  notations: 

AU  B  The  union  of  events  A  U  B  is  the  set  of  all  simple  events  of  A  and  B  which 
occurs  if  at  least  one  of  the  simple  events  of  A  or  B  occurs  (Fig.  6.1,  left 
side,  grey  shaded  area).  Please  note  that  we  use  the  word  “or”  from  a 
statistical  perspective:  “A  or  B ”  means  that  either  a  simple  event  from  A 
occurs,  or  a  simple  event  from  B  occurs,  or  a  simple  event  which  is  part 
of  both  A  and  B  occurs. 

A  H  B  The  intersection  of  events  A  Pi  B  is  the  set  of  all  simple  events  A  and  B 
which  occur  when  a  simple  event  occurs  that  belongs  to  A  and  B  (Fig.  6.1, 
right  side,  grey  shaded  area). 

A\B  The  event  A\B  contains  all  simple  events  of  A,  which  are  not  contained 
in  B.  The  event  “A  but  not  B ”  or  “A  minus  Z?”  occurs,  if  A  occurs  but  B 
does  not  occur.  Also  A\B  =  A  Fi  B  (Fig.  6.2,  left  side,  grey  shaded  area). 
A  The  event  A  contains  all  simple  events  of  Q,  which  are  not  contained  in  A. 

The  complementary  event  of  A  (which  is  “Not- A”  or  “A”  occurs  whenever 
A  does  not  occur  (Fig.  6.2,  right  side,  grey  shaded  area). 

A  c  B  A  is  a  subset  of  B.  This  means  that  all  simple  events  of  A  are  also  part  of 
the  sample  space  of  B. 
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Example  6.1.3  Consider  Example  6.1.1  where  the  sample  space  of  rolling  a  die  was 
determined  as  £2  =  {cei,  002,  . . . ,  ^6}  with  uj\  =  “1”,  002  =  “2”,  . . . ,  =  “6”. 

•  If  A  =  {cji,  CJ2,  CJ3,  CJ4,  CJ5}  and  5  is  the  set  of  all  odd  numbers,  then  5  = 
{cji,  CJ3,  CJ5}  and  thus  Z?  c  A. 

•  If  A  =  {cj 2,  04,  Cc^}  is  the  set  of  even  numbers  and  B  =  {44 ,  04}  is  the  set  of  all 
numbers  which  are  divisible  by  3,  then  A  U  B  =  [u 2,  00 3,  04,  04}  is  the  collection 
of  simple  events  for  which  the  number  is  either  even  or  divisible  by  3  or  both. 

•  If  A  =  {c^i,  04,  04}  is  the  set  of  odd  numbers  and  B  =  {00 3,  04}  is  the  set  of  the 
numbers  which  are  divisible  by  3,  then  A  Pi  B  =  {44}  is  the  set  of  simple  events 
in  which  the  numbers  are  odd  and  divisible  by  3. 

•  If  A  =  {ol)\,  04,  04}  is  the  set  of  odd  numbers  and  B  =  {0^3 ,  04}  is  the  set  of  the 
numbers  which  are  divisible  by  3,  then  A\B  =  {c^i ,  0^5 }  is  the  set  of  simple  events 
in  which  the  numbers  are  odd  but  not  divisible  by  3. 

•  If  A  =  {04,  ^4,  04}  is  the  set  of  even  numbers,  then  A  =  {uo\ ,  04,  ^5 }  is  the  set  of 
odd  numbers. 

Remark  6.1.1  Some  textbooks  also  use  the  following  notations: 

A  +  B  for  A  U  B 
AB  for  A  fl  B 
A-  B  for  A\B. 

We  can  use  these  definitions  and  notations  to  derive  the  following  properties  of  a 
particular  event  A: 

AU  A  =  A  AH  A  =  A 

A  U  Q  =  Q  AHk2  =  A 

AU0  =  A  A  n  0  =  0 

AU  A  =  ^  A  D  A  =  0. 

Definition  6.1.1  Two  events  A  and  B  are  disjoint  if  A  n  B  =  0  holds,  i.e.  if  both 
events  cannot  occur  simultaneously. 

Example  6.1.4  The  events  A  and  A  are  disjoint  events. 

Definition  6.1.2  The  events  A\,  A2, . . . ,  Am  are  said  to  be  mutually  or  pairwise 
disjoint,  if  A/  D  Aj  =  0  whenever  i  7^  j  =  1,  2,  ...,  m. 

Example  6.1.5  Recall  Example  6.1.1.  If  A  =  {uj\,  002 ,  ^5}  and  B  =  {04,  ^4,  ^6}  are 
the  sets  of  odd  and  even  numbers,  respectively,  then  the  events  A  and  B  are  disjoint. 
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Definition  6.1.3  The  events  A\ ,  A2,  . . . ,  Am  form  a  complete  decomposition  of  Q 
if  and  only  if 

Ai  U  A2  U  •  •  •  U  Am  =  Q 


and 


At  Pi  Aj  =  0  (for  all  i  /  j). 


Example  6.1.6  Consider  Example  6.1.1.  The  elementary  events  A\  =  A 2  = 

{^2},  . . . ,  As  =  {c^6}  form  a  complete  decomposition.  Other  complete  decomposi¬ 
tions  are,  e.g. 

•  A\  =  {iU\ ,  CC3 ,  CJ5},  A2  =  {UJ2,  V4,  ^6} 

•  Ai  =  {o;i},  A2  =  {id 2,  CJ3,  CJ4,  CJ5,  6^6} 

•  A\  =  {CJ1,  CJ2,  CJ3},  A2  =  {id 4 ,  CJ5,  4^6 } . 


6.2  Relative  Frequency  and  Laplace  Probability 

There  is  a  close  connection  between  the  relative  frequency  and  the  probability  of 
an  event.  A  random  experiment  is  described  by  its  possible  outcomes,  for  example 
getting  a  number  between  1  and  6  when  rolling  a  die.  Suppose  an  experiment  has  m 
possible  outcomes  (events)  A\ ,  A2,  . . . ,  Am  and  the  experiment  is  repeated  n  times. 
Now  we  can  count  how  many  times  each  of  the  possible  outcome  has  occurred.  In 
other  words,  we  can  calculate  the  absolute  frequency  nt  =  n(Ai)  which  is  equal  to 
the  number  of  times  an  event  A/,  i  =  1,2 ,  ...  ,m,  occurs.  The  relative  frequency 
fi  =  /(A  / )  of  a  random  event  A/ ,  with  n  repetitions  of  the  experiment,  is  calculated 
as 

fi  —  /  (Ai)  =  — .  (6.1) 

n 

Example  6.2.1  Consider  roulette,  a  game  frequently  played  in  casinos.  The  roulette 
table  consists  of  37  numbers  from  0  to  36.  Out  of  these  37  numbers,  18  numbers  are 
red,  1 8  are  black  and  one  (zero)  is  green.  Players  can  place  their  bets  on  either  a  single 
number  or  a  range  of  numbers,  the  colours  red  or  black,  whether  the  number  is  odd 
or  even,  among  many  other  choices.  A  casino  employee  spins  a  wheel  (containing 
pockets  representing  the  37  numbers)  in  one  direction  and  then  spins  a  ball  over  the 
wheel  in  the  opposite  direction.  The  wheel  and  ball  gradually  slow  down  and  the 
ball  finally  settles  in  a  pocket.  The  pocket  number  in  which  the  ball  sits  down  when 
the  wheel  stops  is  the  winning  number.  Consider  three  possible  outcomes  A\ :  “red”, 
A2: “black”,  and  A3 :  “green  (zero)”.  Suppose  the  roulette  ball  is  spun  n  =  500  times. 
All  the  outcomes  are  counted  and  recorded  as  follows:  A\  occurs  240  times,  A2 
occurs  250  times  and  A3  occurs  10  times.  Then,  the  absolute  frequencies  are  given 
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by  n\  =  n(A\)  =  240,  222  =  22  (A2)  =  250,  and  223  =  22 (A3)  =  10.  We  therefore  get 
the  relative  frequencies  as 


/1  =  /(A  1) 

/3  =  /(As) 


240 

500 

10 

500 


0.48,  fi  =  f  (A2) 


250 

500 


0.02. 


If  we  assume  that  the  experiment  is  repeated  a  large  number  of  times  (mathe¬ 
matically,  this  would  mean  that  22  tends  to  infinity)  and  the  experimental  conditions 
remain  the  same  (at  least  approximately)  over  all  the  repetitions,  then  the  relative 
frequency  /(A)  converges  to  a  limiting  value  for  A.  This  limiting  value  is  interpreted 
as  the  probability  of  A  and  denoted  by  P(A),  i.e. 


P  ( A )  =  lim 


22(A) 


n- 


•00 


22 


where  22(A)  denotes  the  number  of  times  an  event  A  occurs  out  of  22  times. 


Example  6.2.2  Suppose  a  fair  coin  is  tossed  22  =  20  times  and  we  observe  the  number 
of  heads  22 (A  1)  =  8  times  and  number  of  tails  22 (A2)  =  12  times.  The  meaning 
of  a  fair  coin  in  this  case  is  that  the  probabilities  of  head  and  tail  are  equal  (i.e. 
0.5).  Then,  the  relative  frequencies  in  the  experiment  are  f(A\)  =  8/20  =  0.4  and 
/(A2)  =  12/20  =  0.6.  When  the  coin  is  tossed  a  large  number  of  times  and  22  tends 
to  infinity,  then  both  f(A\)  and  /(A2)  will  have  a  limiting  value  0.5  which  is  simply 
the  probability  of  getting  a  head  or  tail  in  tossing  a  fair  coin. 

Example  6.2.3  In  Example  6.2.1,  the  relative  frequency  of  /(red)  =  f(A\)  tends  to 
18/37  as  22  tends  to  infinity  because  18  out  of  37  numbers  are  red. 


The  reader  will  gain  a  more  theoretical  understanding  of  how  repeated  experiments 
relate  to  expected  quantities  in  the  following  chapters  after  learning  the  Theorem  of 
Large  Numbers  described  in  Appendix  A. 3. 

A  different  definition  of  probability  was  given  by  Pierre-Simon  Laplace  (1749- 
1827).  We  call  an  experiment  a  Laplace  experiment  if  the  number  of  possible 
simple  events  is  finite  and  all  the  outcomes  are  equally  probable.  The  probability  of 
an  arbitrary  event  A  is  then  defined  as  follows: 


Definition  6.2.1  The  proportion 


P(A)  = 


|A[ 

1X21 


Number  of  “favourable  simple  events”  for  A 
Total  number  of  possible  simple  events 


(6.2) 


is  called  the  Laplace  probability,  where  |  A  |  is  the  cardinal  number  of  A,  i.e.  the 
number  of  simple  events  contained  in  the  set  A,  and  |X2|  is  the  cardinal  number  of 
X2,  i.e.  the  number  of  simple  events  contained  in  the  set  X2. 


The  cardinal  numbers  |  A  |  and  \Q\  are  often  calculated  using  the  combinatoric 
rules  introduced  in  Chap.  5. 
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Example  6.2.4  ( Example  6.1.2  continued )  The  sample  space  contains  36  simple 
events.  All  of  these  simple  events  have  equal  probability  1/36.  To  calculate  the 
probability  of  the  event  A  that  the  sum  of  the  dots  on  the  two  dice  is  at  least  4  and  at 
most  6,  we  count  the  favourable  simple  events  which  fulfil  this  condition.  The  simple 
events  are  (1,  3),  (2,  2),  (3,  1)  (sum  is  4),  (1,  4),  (2,  3),  (4,  1),  (3,  2)  (sum  is  5)  and 
(1,  5),  (2,  4),  (3,  3),  (4,  2),  (5,  1)  (sum  is  6).  In  total,  there  are  (3  +  4  +  5)  =  12 
favourable  simple  events,  i.e. 

A  =  {(1,3),  (2,  2),  (3,1),  (1,4),  (2,  3),  (4,  1), 

(3,  2),  (1,5),  (2,  4),  (3,  3),  (4,  2),  (5,1)}  . 

The  probability  of  the  event  A  is  therefore  12/36  =  1/3. 


6.3  The  Axiomatic  Definition  of  Probability 

An  important  foundation  for  modern  probability  theory  was  established  by 
A.N.  Kolmogorov  in  1933  when  he  proposed  the  following  axioms  of  probability. 

Axiom  1  Every  random  event  A  has  a  probability  in  the  (closed)  interval  [0,  1],  i.e. 

0  <  P(A)  <  1. 

Axiom  2  The  sure  event  has  probability  1,  i.e. 

P(Q)  =  1. 


Axiom  3  If  A\  and  A 2  are  disjoint  events,  then 

E(AiUA2)  =  E(Ai)  +  P(A2). 


holds. 


Remark  Axiom  3  also  holds  for  three  or  more  disjoint  events  and  is  called  the 
theorem  of  additivity  of  disjoint  events.  For  example,  if  A\ ,  A2,  and  A3  are  disjoint 
events,  then  P(A\  UA2U  A3)  =  E(Ai)  +  P(A2)  +  P(A2). 

Example  6.3. 1  Suppose  the  two  events  in  tossing  a  coin  are  A 1 :  “appearance  of  head” 
and  A2:  “appearance  of  tail”  which  are  disjoint.  The  event  A 1  U  A2:  “appearance  of 
head  or  tail”  has  the  probability 

P(A\  U  A2)  =  P(Ai)  +  P(A2)  =  1/2  +  1/2  =  1. 

Example  6.3.2  Suppose  an  event  is  defined  as  the  number  of  points  observed  on  the 
upper  surface  of  a  die  when  rolling  it.  There  are  six  events,  i.e.  the  natural  numbers 
1,  2,  3, 4,  5,  6.  These  events  are  disjoint  and  they  have  equal  probability  of  occurring: 
P(l)  =  P( 2)  =  •  •  •  =  P(6)  =  1/6.  The  probability  of  getting  an  even  number  is 
then 


E(“even  number”)  =  P(2)  +  E(4)  +  E(6)  =  1/6  +  1/6  +  1/6  =  1/2. 
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6.3.1  Corollaries  Following  from  Kolomogorov's  Axioms 

We  already  know  that  A  U  A  =  £2  (sure  event).  Since  A  and  A  are  disjoint,  using 
Axiom  3  we  have 

P(A  U  A)  =  P(A)  +  P(A)  =  1 . 

Based  on  this,  we  have  the  following  corollaries. 

Corollary  1  The  probability  of  the  complementary  event  of  A,  (i.e.  A)  is 

P(A)  =  1  -  P(A).  (6.3) 

Example  6.3.3  Suppose  a  box  of  30  chocolates  contains  chocolates  of  6  differ¬ 
ent  flavours  with  5  chocolates  of  each  flavour.  Suppose  an  event  A  is  defined  as 
A  =  {“marzipan  flavour”}.  The  probability  of  finding  a  marzipan  chocolate  (with¬ 
out  looking  into  the  box)  is  P (“marzipan”)  =  5/30.  Then,  the  probability  of  the 
complementary  event  A,  i.e.  the  probability  of  not  finding  a  marzipan  chocolate  is 
therefore 

P(“no  marzipan  flavour”)  =  1  —  P (“marzipan  flavour”)  =  25/30. 

Corollary  2  The  probability  of  occurrence  of  an  impossible  event  0  is  zero: 

P(0)  =  P(T2)  =  1  -  P(Q)  =  0. 


Corollary  3  Let  A\  and  A2  be  not  necessarily  disjoint  events.  The  probability  of 
occurrence  of  A\  or  A 2  is 

P(Ai  U  A2)  =  P(Ai)  +  P(A2)  -  P(Ai  H  A2).  (6.4) 

The  rule  in  (6.4)  is  known  as  the  additive  theorem  of  probability.  Again  we  use 
the  word  “or”  in  the  statistical  sense:  either  A\  is  occurring,  A2  is  occurring,  or  both 
of  them.  This  means  we  have  to  add  the  probabilities  P(A\)  and  P(A2)  but  need  to 
make  sure  that  the  simple  events  which  are  contained  in  both  sets  are  not  counted 
twice,  thus  we  subtract  P(A  \  n  A2). 

Example  6.3.4  There  are  10  actors  acting  in  a  play.  Two  actors,  one  of  whom  is 
male,  are  portraying  evil  characters.  In  total,  there  are  6  female  actors.  Let  an  event 
A  describe  whether  the  actor  is  male  and  another  event  B  describe  whether  the 
character  is  evil.  Suppose  we  want  to  know  the  probability  of  a  randomly  chosen 
actor  being  male  or  evil.  We  can  then  calculate 

P  (actor  is  male  or  evil)  = 

=  P  (actor  is  male)  +  P  (actor  is  evil)  —  P  (actor  is  male  and  evil) 

4  2  11 
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Corollary  4  If  A  c  B  then  P(A )  <  P(B). 


Proof  We  use  the  representation  B  =  AU  (AH  B)  where  A  and  A  P\  B  are  the 
disjoint  events.  Then  using  Axiom  3  and  Axiom  1,  we  get 

P(B)  =  P(A)  +  P(A  HB)>  P(A) . 


6.3.2  Calculation  Rules  for  Probabilities 

The  introduced  axioms  and  corollaries  can  be  summarized  as  follows: 

(1)  0  <  P(A)  <  1 

(2)  P(£2)  =  1 

(3)  P(A\  U  A2)  =  P(A\)  +  P(A2),  if  A\  and  A2  are  disjoint 

(4)  P(0)  =  0 

(5)  P(A)  =  1  -  P(A) 

(6)  P(Ai  U  A2)  =  P(Ai)  +  P(A2)  ~  P(Ai  H  A2) 

(7)  P(A)  <  P(5),if  A  c  B 


6.4  Conditional  Probability 

Consider  the  following  example  to  understand  the  concept  of  conditional  probability: 
Suppose  a  new  medical  test  is  developed  to  diagnose  a  particular  infection  of  the 
blood.  The  test  is  conducted  on  blood  samples  from  100  randomly  selected  patients 
and  the  outcomes  of  the  tests  are  presented  in  Table  6.1. 

There  are  the  following  four  possible  outcomes: 

•  The  blood  sample  has  an  infection  and  the  test  diagnoses  it,  i.e.  the  test  is  correctly 
diagnosing  the  infection. 

•  The  blood  sample  does  not  have  an  infection  and  the  test  does  not  diagnose  it,  i.e. 
the  test  is  correctly  diagnosing  that  there  is  no  infection. 

•  The  blood  sample  has  an  infection  and  the  test  does  not  diagnose  it,  i.e.  the  test  is 
incorrect  in  stating  that  there  is  no  infection. 

•  The  blood  sample  does  not  have  an  infection  but  the  test  diagnoses  it,  i.e.  the  test 
is  incorrect  in  stating  that  there  is  an  infection. 

Table  6.2  contains  the  relative  frequencies  of  Table  6.1.  In  the  following,  we  in¬ 
terpret  the  relative  frequencies  as  probabilities,  i.e.  we  assume  that  the  values  in 
Table  6.2  would  be  observed  if  the  number  n  of  patients  was  much  larger  than  100. 

It  can  be  seen  that  the  probability  that  a  test  is  positive  is  P(T+)  =  0.30 +  0.10  = 
0.40  and  the  probability  that  an  infection  is  present  is  P(IP)  =  0.30  +  0.15  =  0.45. 


118 


6  Elements  of  Probability  Theory 


Table  6.1  Absolute  frequencies  of  test  results  and  infection  status 


Infection 

Total  (row) 

Present 

Absent 

Test 

Positive  (+) 

30 

10 

40 

Negative  (— ) 

15 

45 

60 

Total  (column) 

45 

55 

Total  =100 

Table  6.2  Relative  frequencies  of  patients  and  test 


Infection 

Total  (row) 

Present  (IP) 

Absent  (IA) 

Test 

Positive  (+) 

0.30 

0.10 

0.40 

Negative  (— ) 

0.15 

0.45 

0.60 

Total  (column) 

0.45 

0.55 

Total  =  1 

If  one  already  knows  that  the  test  is  positive  and  wants  to  determine  the  probability 
that  the  infection  is  indeed  present,  then  this  can  be  achieved  by  the  respective 

conditional  probability  P(IP\T+)  which  is 


P(IP\T+)  = 


p(ip  n  r+) 

P(T+) 


0.3 

04 


0.75. 


Note  that  I P  PiT+  denotes  the  “relative  frequency  of  blood  samples  in  which  the 
disease  is  present  and  the  test  is  positive”  which  is  0.3. 

More  generally,  recall  Definition  4.1.1  from  Chap.  4  where  we  defined  condi¬ 
tional,  joint,  and  marginal  frequency  distributions  in  contingency  tables.  The  present 
example  simply  applies  these  rules  to  the  contingency  tables  of  relative  frequencies 
and  interprets  the  relative  frequencies  as  an  approximation  to  the  probabilities  of 
interest,  as  already  explained. 

We  use  the  intersection  operator  D  to  describe  events  which  occur  for  A  =  a 
and  B  =  b.  This  relates  to  the  joint  relative  frequencies.  The  marginal  relative  fre¬ 
quencies  (i.e.  probabilities  P(A  =  a))  can  be  observed  from  the  column  and  row 
sums,  respectively;  and  the  conditional  probabilities  can  be  observed  as  the  joint 
frequencies  in  relation  to  the  marginal  frequencies. 

For  simplicity,  assume  that  all  simple  events  in  £2  =  {u)\  ,  u>2,  . . . ,  ujk]  are  equally 
probable,  i.e.  P  (ujj  )  =  j  =  1,2 ,...,&.  Let  A  and  B  be  two  events  containing  n\ 
and  n  b  numbers  of  simple  events.  Let  further  AHS  contain  nAB  numbers  of  simple 
events.  The  Laplace  probability  using  (6.2)  is 


P(A)  = 


PA_ 

k  ’ 


P(B)  = 


ns 

T’ 


P(A  n  B) 


nAB 

k 


Assume  that  we  have  prior  information  that  A  has  already  occurred.  Now  we  want  to 
find  out  how  the  probability  of  B  is  to  be  calculated.  Since  A  has  already  occurred, 
we  know  that  the  sample  space  is  reduced  by  the  number  of  simple  events  which 
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are  contained  in  A.  There  are  nA  such  simple  events.  Thus,  the  total  sample  space 
£2  is  reduced  by  the  sample  space  of  A.  Therefore,  the  simple  events  in  A  (1  B  are 
those  simple  events  which  are  realized  when  B  is  realized.  The  Laplace  probability 
for  B  under  the  prior  information  on  A,  or  under  the  condition  that  A  is  known,  is 
therefore 


nAnlk 

P(B\A)  =  ——y-r 
nA/k 


p(A  n  B) 
P(A) 


(6.5) 


This  can  be  generalized  to  the  case  when  the  probabilities  for  simple  events  are 
unequal. 


Definition  6.4.1  Let  P(A)  >  0.  Then  the  conditional  probability  of  event  B  oc¬ 
curring,  given  that  event  A  has  already  occurred,  is 


P(B\A)  = 


P(A  n  B) 
P(A) 


(6.6) 


The  roles  of  A  and  B  can  be  interchanged  to  define  P  (A  |  B)  as  follows.  Let  P  (B)  >  0. 
The  conditional  probability  of  A  given  B  is 


P(A\B)  = 


p(a  n  B) 
P(B) 


(6.7) 


We  now  introduce  a  few  important  theorems  which  are  relevant  to  calculating 
conditional  and  other  probabilities. 


Theorem  6.4.1  (Multiplication  Theorem  of  Probability)  For  two  arbitrary  events  A 
and  B,  the  following  holds: 

P(A  HB)  =  P(A\B)P(B )  =  P(B\A)P(A).  (6.8) 


This  theorem  follows  directly  from  the  two  definitions  (6.6)  and  (6.7)  (but  does  not 
require  that  P(A)  >  0  and  P(B)  >  0). 


Theorem  6.4.2  (Law  of  Total  Probability)  Assume  that  A\ ,  A2,  . . . ,  Am  are  events 
such  that  U?=1Ai  =  F2  and  A/  Pi  Aj  =  0  for  all  i  /  j,  P(Aj)  >  0  for  all  i,  i.e.  A\, 
A 2,  . . . ,  Am  form  a  complete  decomposition  of  Q  =  U ”L1  A/  in  pairwise  disjoint 
events,  then  the  probability  of  an  event  B  can  be  calculated  as 

m 

P(B)  =  J^P(B\Ai)P(Ai). 

i= 1 


(6.9) 
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6.4.1  Bayes"  Theorem 

Bayes’  Theorem  gives  a  connection  between  P  (A  \  B)  and  P (B  \  A) .  For  events  A  and 
B  with  P(A)  >  0  and  P(B)  >  0,  using  (6.6)  and  (6.7)  or  (6.8),  we  get 


P(A\B)  = 


P(A  n  B)  _  P(A  n  B)  P(A) 
P(B)  P(A)  P(B) 

P(B\A)P(A) 

P(B) 


(6.10) 


Let  Ai,  A 2,  •  •  • ,  Am  be  events  such  that  U"!=1  A/  =  Q  and  A/  Pi  Aj  =  0  for  all  i  ^ 
j,  P(Ai )  >  0  for  all  i,  and  B  is  another  event  than  A,  then  using  (6.9)  and  (6.10), 
we  get 


P(Aj\B) 


P(B\Aj)P(Aj) 

Si  P(B\Ai)P(Ai)  ' 


(6.11) 


The  probabilities  P(A;)  are  called  prior  probabilities,  P(B\Af )  are  sometimes 
called  model  probabilities  and  P(A;  |B)  are  called  posterior  probabilities. 


Example  6.4.1  Suppose  someone  rents  movies  from  two  different  DVD  stores. 
Sometimes  it  happens  that  the  DVD  does  not  work  because  of  scratches.  We  con¬ 
sider  the  following  events:  A/  (i  =  1,2):  “the  DVD  is  rented  from  store  Further 
let  B  denote  the  event  that  the  DVD  is  working  without  any  problems.  Assume  we 
know  that  P(A\)  =  0.6  and  P(A2)  =  0.4  (note  that  A 2  =  Ai)  and  P(B\A\)  =0.95, 
P(Z?|A2)  =  0.75  and  we  are  interested  in  the  probability  that  a  rented  DVD  works 
fine.  We  can  then  apply  the  Law  of  Total  Probability  and  get 

P(B)  (=9)  PiBlA^PiAx)  +  P(B\A2)P(A2 ) 

=  0.6 -0.95 +  0.4 -0.75  =  0.87. 

We  may  also  be  interested  in  the  probability  that  the  movie  was  rented  from  store  1 
and  is  working  which  is 

P(B  n  Ai)  (=8)  P(B\A{)P(Ai)  =  0.95  •  0.6  =  0.57. 


Now  suppose  we  have  a  properly  working  DVD.  What  is  the  probability  that  it  is 
rented  from  store  1?  This  is  obtained  as  follows: 


P(Ai\B) 


(6/7)  P(A i  n  B) 
P(B ) 


0.57 

087 


0.6552. 


Now  assume  we  have  a  DVD  which  does  not  work,  i.e.  B  occurs.  The  probability 
that  a  DVD  is  not  working  given  that  it  is  from  store  1  is  P(B  \  A\)  =  0.05.  Similarly, 
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P(B\A2)  =  0.25  for  store  2.  We  can  now  calculate  the  conditional  probability  that 
a  DVD  is  from  store  1  given  that  it  is  not  working: 


P(M\B) 


(6.11) 


P(B\A\)P(A\) 

P(B\A1)P(A1)  +  P(B\A2)P(A2) 


0.05  •  0.6 

0.05  •  0.6  +  0.25  •  0.4 


0.2308. 


The  result  about  P(B)  used  in  the  denominator  can  also  be  directly  obtained  by  using 
P(B)  =  1-0.87  =  0.13. 


6.5  Independence 

Intuitively,  two  events  are  independent  if  the  occurrence  or  non-occurrence  of  one 
event  does  not  affect  the  occurrence  or  non-occurrence  of  the  other  event.  In  other 
words,  two  events  A  and  B  are  independent  if  the  probability  of  occurrence  of  B  has 
no  effect  on  the  probability  of  occurrence  of  A.  In  such  a  situation,  one  expects  that 

P(A\B)  =  P(A )  and  P(A\B)  =  P(A)  . 

Using  this  and  (6.7),  we  can  write 

P(A\B)~- 


This  yields: 

P(A  H  B)P(B)  =  P(A  H  B)P(B) 

P(A  H  £)(1  -  P(B))  =  P(A  H  B)P(B) 

P(A  HB)  =  ( P(A  HS)  +  P(A  H  B))P(B) 

P(A  n  B)  =  P(A)P(B) .  (6.13) 

This  leads  to  the  following  definition  of  stochastic  independence. 

Definition  6.5.1  Two  random  events  A  and  B  are  called  (stochastically)  indepen¬ 
dent  if 

P(A  HB)  =  P(A)P(B)  ,  (6.14) 

i.e.  if  the  probability  of  simultaneous  occurrence  of  both  events  A  and  B  is  the 
product  of  the  individual  probabilities  of  occurrence  of  A  and  B. 


P(A  H  B) 


P(B ) 

P(AHB) 

-A — =  P(A\B).  (6.12) 

P(B) 
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This  definition  of  independence  can  be  extended  to  the  case  of  more  than  two 
events  as  follows: 

Definition  6.5.2  The  n  events  A\,  A2,  . . . ,  An  are  stochastically  mutually  indepen¬ 
dent,  if  for  any  subset  of  m  events  A^ ,  A/2,  . . . ,  A;m  ( m  <  n) 

P(Ah  n  Ai2  •  •  •  n  A/w)  =  P(Ah)P(Ai2)  .....  P(A/m)  (6.15) 

holds. 

A  weaker  form  of  independence  is  pairwise  independence.  If  condition  (6.15)  is 
fulfilled  only  for  two  arbitrary  events,  i.e.  m  =  2,  then  the  events  are  called  pairwise 
independent.  The  difference  between  pairwise  independence  and  general  stochastic 
independence  is  explained  in  the  following  example. 


Example  6.5.1  Consider  an  urn  with  four  balls.  The  following  combinations  of  ze¬ 
roes  and  ones  are  printed  on  the  balls:  110,  101,  Oil,  000.  One  ball  is  drawn  from 
the  urn.  Define  the  following  events: 

A 1  :  The  first  digit  on  the  ball  is  1 . 

A 2  :  The  second  digit  on  the  ball  is  1 . 

A3  :  The  third  digit  on  the  ball  is  1. 


Since  there  are  two  favourable  simple  events  for  each  of  the  events  A\,  A2  and  A3, 
we  get 


P(Al)  =  P(A2)  =  P(A3)  =  ^  =  ^. 


The  probability  that  all  the  three  events  simultaneously  occur  is  zero  because  there 
is  no  ball  with  111  printed  on  it.  Therefore,  A 1,  A 2,  and  A3  are  not  stochastically 
independent  because 


p (A\) p (a2) p (a3)  =  -  7^  0  =  P(A\  n  a2  n  A3). 

o 

However, 

P(Ai  HA2)  =  1  =  />(A1)JP(A2), 

p(A\  n  A3)  =  1  =  />(Ai)/,(A3), 

P(A2  n  a3)  =  1  =  P(A2)P(A3) , 
which  means  that  the  three  events  are  pairwise  independent. 
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6.6  Key  Points  and  Further  Issues 


Note: 


/  We  summarize  some  important  theorems  and  laws: 


•  The  Laplace  probability  is  the  ratio 


P(A)  = 


|A| 

\Q\ 


Number  of  “favourable  simple  events”  for  A 
Total  number  of  possible  simple  events 


•  The  Law  of  Total  Probability  is 


m 

P(B)  =  YJP(B\Ai)P{Ai). 

i  =  \ 


•  Bayes’  Theorem  is 


P(Aj\B) 


P(B\Aj)P(Aj) 

'ZiP(B\Ai)P(Aiy 


•  n  events  A\,  A2,  . . . ,  An  are  (stochastically)  independent,  if 


P(Ai  n  a2  •  •  •  n  An)  =  p(AOP(A2)  • . . .  •  P(An). 


/  In  Sect.  10.8,  we  present  the  x2 -independence  test,  which  can  test 
whether  discrete  random  variables  (see  Chap.  7)  are  independent  or  not. 


6.7  Exercises 

Exercise  6. 1 

(a)  Suppose  Q  =  {0,  1, ...,  15},  A  =  {0,  8},  B  =  {1,2,  3,5,  8,  10,  12},  C  =  {0,4, 
9,  15}.  Determine  A  P\  B,  B  P\  C,  AUC,  C  \  A,  Q  \  (B  U  AU  C). 

(b)  Now  consider  the  three  pairwise  disjoint  events  E ,  F,  G  with  £2  =  E  U 
F  U  G  and  P(E )  =  0.2  and  P(F)  =  0.5.  Calculate  P(F),  P(G),  P(E  n  G), 
P(E  \E),  and  P(EU  F). 

Exercise  6.2  A  driving  licence  examination  consists  of  two  parts  which  are  based  on 
a  theoretical  and  a  practical  examination.  Suppose  25  %  of  people  fail  the  practical 
examination,  15  %  of  people  fail  the  theoretical  examination,  and  10  %  of  people  fail 
both  the  examinations.  If  a  person  is  randomly  chosen,  then  what  is  the  probability 
that  this  person 
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(a)  fails  at  least  one  of  the  examinations? 

(b)  only  fails  the  practical  examination,  but  not  the  theoretical  examination? 

(c)  successfully  passes  both  the  tests? 

(d)  fails  any  of  the  two  examinations? 

Exercise  6.3  A  new  board  game  uses  a  twelve-sided  die.  Suppose  the  die  is  rolled 
once,  what  is  the  probability  of  getting 

(a)  an  even  number? 

(b)  a  number  greater  than  9? 

(c)  an  even  number  greater  than  9? 

(d)  an  even  number  or  a  number  greater  than  9? 

Exercise  6.4  The  Smiths  are  a  family  of  six.  They  are  celebrating  Christmas  and 
there  are  12  gifts,  two  for  each  family  member.  The  name  tags  for  each  family 
member  have  been  attached  to  the  gifts.  Unfortunately  the  name  tags  on  the  gifts  are 
damaged  by  water.  Suppose  each  family  member  draws  two  gifts  at  random.  What 
is  the  probability  that  someone 

(a)  gets  his/her  two  gifts,  rather  than  getting  the  gifts  for  another  family  member? 

(b)  gets  none  of  his/her  gifts,  but  rather  gets  the  gifts  for  other  family  members? 

Exercise  6.5  A  chef  from  a  popular  TV  cookery  show  sometimes  puts  too  much  salt 
in  his  pumpkin  soup  and  the  probability  of  this  happening  is  0.2.  If  he  is  in  love 
(which  he  is  with  probability  0.3),  then  the  probability  of  using  too  much  salt  is  0.6. 

(a)  Create  a  contingency  table  for  the  probabilities  of  the  two  variables  “in  love” 
and  “too  much  salt”. 

(b)  Determine  whether  the  two  variables  are  stochastically  independent  or  not. 

Exercise  6.6  Dr.  Obermeier  asks  his  neighbour  to  take  care  of  his  basil  plant  while 
he  is  away  on  leave.  He  assumes  that  his  neighbour  does  not  take  care  of  the  basil 
with  a  probability  of  ^ .  The  basil  dies  with  probability  ^  when  someone  takes  care 

Q 

of  it  and  with  probability  |  if  no  one  takes  care  of  it. 

(a)  Calculate  the  probability  of  the  basil  plant  surviving  after  its  owner’s  leave. 

(b)  It  turns  out  that  the  basil  eventually  dies.  What  is  the  probability  that 
Dr.  Obermeier’ s  neighbour  did  not  take  care  of  the  plant? 

Exercise  6.7  A  bank  considers  changing  its  credit  card  policy.  Currently  5  %  of  credit 
card  owners  are  not  able  to  pay  their  bills  in  any  month,  i.e.  they  never  pay  their  bills. 
Among  those  who  are  generally  able  to  pay  their  bills,  there  is  still  a  20  %  probability 
that  the  bill  is  paid  too  late  in  a  particular  month. 
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(a)  What  is  the  probability  that  someone  is  not  paying  his  bill  in  a  particular  month? 

(b)  A  credit  card  owner  did  not  pay  his  bill  in  a  particular  month.  What  is  the 
probability  that  he  never  pays  back  the  money? 

(c)  Should  the  bank  consider  blocking  the  credit  card  if  a  customer  does  not  pay  his 
bill  on  time? 

Exercise  6.8  There  are  epidemics  which  affect  animals  such  as  cows,  pigs,  and  oth¬ 
ers.  Suppose  200  cows  are  tested  to  see  whether  they  are  infected  with  a  virus  or  not. 
Let  event  A  describe  whether  a  cow  has  been  transported  by  a  truck  recently  or  not 
and  let  B  denote  the  event  that  a  cow  has  been  tested  positive  with  a  virus.  The  data 
are  summarized  in  the  following  table: 


B  B 

Total 

A 

40  60 

100 

A 

20  80 

100 

Total 

60  140 

200 

(a)  What  is  the  probability  that  a  cow  is  infected  and  has  been  transported  by  a  truck 
recently? 

(b)  What  is  the  probability  of  having  an  infected  cow  given  that  it  has  been  trans¬ 
ported  by  the  truck? 

(c)  Determine  and  interpret  P(B). 

Exercise  6.9  A  football  practice  target  is  a  portable  wall  with  two  holes  (which  are 
the  target)  in  it  for  training  shots.  Suppose  there  are  two  players  A  and  B.  The 
probabilities  of  hitting  the  target  by  A  and  B  are  0.4  and  0.5,  respectively. 

(a)  What  is  the  probability  that  at  least  one  of  the  players  succeeds  with  his  shot? 

(b)  What  is  the  probability  that  exactly  one  of  the  players  hits  the  target? 

(c)  What  is  the  probability  that  only  B  scores? 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  361 


"Source  Toutenburg,  H.,  Heumann,  C.,  Induktive  Statistik ,  4th  edition,  2007, 
Springer,  Heidelberg 


Random  Variables 


In  the  first  part  of  the  book  we  highlighted  how  to  describe  data.  Now,  we  discuss 
the  concepts  required  to  draw  statistical  conclusions  from  a  sample  of  data  about  a 
population  of  interest.  For  example,  suppose  we  know  the  starting  salary  of  a  sample 
of  100  students  graduating  in  law.  We  can  use  this  knowledge  to  draw  conclusions 
about  the  expected  salary  for  the  population  of  all  students  graduating  in  law.  Simi¬ 
larly,  if  a  newly  developed  drug  is  given  to  a  sample  of  selected  tuberculosis  patients, 
then  some  patients  may  show  improvement  and  some  patients  may  not,  but  we  are 
interested  in  the  consequences  for  the  entire  population  of  patients.  In  the  remainder 
of  this  chapter,  we  describe  the  theoretical  concepts  required  for  making  such  con¬ 
clusions.  They  form  the  basis  for  statistical  tests  and  inference  which  are  introduced 
in  Chaps.  9-11. 


7.1  Random  Variables 

Random  variables  help  us  to  view  the  collected  data  as  an  outcome  of  a  random 
experiment.  Consider  the  simple  experiment  of  tossing  a  coin.  If  a  coin  is  tossed, 
then  one  can  observe  either  “head”  (//)  or  “tail”  ( T ).  The  occurrence  of  “head”  or 
“tail”  is  random,  and  the  exact  outcome  will  only  be  known  after  the  coin  is  tossed. 
We  can  toss  the  coin  many  times  and  obtain  a  sequence  of  outputs.  For  example,  if 
a  coin  is  tossed  seven  times,  then  one  of  the  outcomes  may  be  H ,  H,  T ,  H ,  T,T,T. 
This  outcome  is  the  consequence  of  a  random  experiment,  and  it  may  be  helpful  if 
we  can  distill  the  sequence  of  outcomes  in  meaningful  numbers.  One  option  is  to 
summarize  them  by  a  variable  X ,  which  takes  the  values  x\  =  1  (denoting  head) 
and  X2  =  0  (denoting  tail).  We  have  learnt  from  Chap.  6  that  this  can  be  described 
in  the  framework  of  a  random  experiment  where  Q  =  {ccq ,  0^2)  with  the  events 
A 1  =  {c^i}  =  1  =  head  and  A2  =  {^2}  =  0  =  tail.  The  random  variable  X  is 
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Table  7.1  Examples  of  random  variables 


X 

Event 

Realizations  of  X 

Roll  of  a  die 

A[ :  number  i  (i  =  1,2, ...  ,6) 

v  =  i 

Lifetime  of  TV 

Aj :  survival  time  is  i  months 

a  =  i,  2, . . .) 

x  =  i 

Roulette 

A i :  red 

X\  =  1 

A2'.  black 

<N 

II 

<N 

X 

A3 :  green  (zero) 

X3  =  0 

now  mapped  to  real  numbers,  and  therefore,  it  describes  the  possible  outcome  of 
any  coin  toss  experiment.  The  observed  outcomes  H ,  H,  T,  H,  T,  T,  T  relate  to  a 
specific  sample,  a  unique  realization  of  this  experiment.  We  can  write  X(ctq)  =  1 
and  X(u2)  =  0  with  uj\,uj2  £  &  and  1,  0  e  1Z  where  1Z  is  the  set  of  real  numbers. 
We  know  that  in  any  coin  tossing  experiment,  the  probability  of  head  being  observed 
is  P{X(<jl)\)  =  1)  =  0.5  and  of  tail  being  observed  is  P(X( cez)  =  0)  =  0.5.  We 
may  therefore  view  X  as  a  random  variable  which  collects  the  possible  outcomes  of 
a  random  experiment  and  captures  the  uncertainty  associated  with  them. 

Definition  7.1.1  Let  E2  represent  the  sample  space  of  a  random  experiment,  and  let 
7 Z  be  the  set  of  real  numbers.  A  random  variable  is  a  function  X  which  assigns  to 
each  element  u  e  £2  one  and  only  one  number  X(u)  =  x,  x  elZ,  i.e. 

X  :  Q  ->  1Z.  (7.1) 

Example  7.1.1  The  features  of  a  die  roll  experiment,  a  roulette  game,  or  the  lifetime 
of  a  TV  can  all  be  described  by  a  random  variable,  see  Table  7.1.  The  events  involve 
randomness,  and  if  we  have  knowledge  about  the  random  process,  we  can  assign 
probabilities  P(X  =  Xi)  to  each  event,  e.g.  when  rolling  a  die,  the  probability  of 
getting  a  “1”  is  P(X  =  1)  =  1/6  and  the  probability  of  getting  a  “2”  is  P(X  =  2)  = 
1/6. 

Note  that  it  is  a  convention  to  denote  random  variables  by  capital  letters  (e.g.  X) 
and  their  values  by  small  letters  (e.g.  x).  It  is  evident  from  the  coin  tossing  experiment 
that  we  need  to  know  P(X  =  x)  to  describe  the  respective  random  variable.  We 
assume  in  this  chapter  that  we  have  this  knowledge.  However,  Chaps.  9-11  show 
how  a  sample  of  data  can  be  used  to  estimate  unknown  probabilities  and  other 
quantities  given  a  prespecified  uncertainty  level.  More  generally,  we  can  say  that  it 
is  mandatory  to  know  P(X  e  A)  for  all  possible  A  which  are  subsets  of  1Z.  If  we 
choose  A  =  (—oo,  x],  x  e  X,  we  have 

P(X  e  A)  =  P(X  E  (-00,  x])  =  P(-o o  <  X  <x)  =  P(X  <  x). 

This  consideration  gives  rise  to  the  definition  of  the  cumulative  distribution  func¬ 
tion.  Recall  that  we  developed  the  concept  of  the  empirical  cumulative  distribution 
function  (ECDF)  in  Chap.  2,  Sect.  2.2,  but  the  definition  there  was  empirical.  Now, 
we  develop  it  theoretically. 
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7.2  Cumulative  Distribution  Function  (CDF) 

Definition  7.2. 1  The  cumulative  distribution  function  (CDF)  of  a  random  variable 
X  is  defined  as 

F(x)  =  P(X  <  x).  (7.2) 

As  in  Chap.  2,  we  can  see  that  the  CDF  is  useful  in  obtaining  the  probabilities  related 
to  the  occurrence  of  random  events.  Note  that  the  empirical  cumulative  distribu¬ 
tion  function  (ECDF,  Sect.  2.2)  and  the  cumulative  distribution  function  are  closely 
related  and  therefore  have  a  similar  definition  and  similar  calculation  rules.  How¬ 
ever,  in  Chap.  2,  we  work  with  the  cumulative  distribution  of  observed  values  in  a 
particular  sample  whereas  in  this  chapter,  we  deal  with  random  variables  modelling 
the  distribution  of  a  general  population. 

The  Definition  7.2. 1  implies  the  following  properties  of  the  cumulative  distribution 
function: 

•  F(x)  is  a  monotonically  non-decreasing  function 
(if  x\  <  X2,  it  follows  that  F(x i)  <  F(X2)X 

•  lirn  ^-oo  F(x)  =  0  (the  lower  limit  of  F  is  0), 

•  lirn  ^+oo  F(x)  =  1  (the  upper  limit  of  F  is  1), 

•  Fix)  is  continuous  from  the  right,  and 

•  0  <  F(x)  <  1  for  all  x  e  1Z. 

Another  notation  for  Fix)  =  P(X  <  x)  is  Fx(x),  but  we  use  Fix). 


7.2.1  CDF  of  Continuous  Random  Variables 

Before  giving  some  examples  about  the  meaning  and  interpretation  of  the  CDF,  we 
first  need  to  consider  some  definitions  and  theorems. 

Definition  7.2.2  A  random  variable  X  is  said  to  be  continuous  if  there  is  a  function 
f{x)  such  that  for  all  v  e  FZ 

/X 

fit)  At  (7.3) 

-00 

holds.  F{x)  is  the  cumulative  distribution  function  (CDF)  of  X,  and  fix)  is  the 
probability  density  function  (PDF)  of  x  and  j^F(x)  =  fix)  for  all  v  that  are 
continuity  points  of  /. 

Theorem  7.2.1  For  a  function  fix)  to  be  a  probability  density  function  (PDF)  of 

X,  it  needs  to  satisfy  the  following  conditions: 


(1)  f(x)  >  0  for  all  x  e  7 Z, 

(2)  JZc  f(Odx  =  1. 
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Theorem  7.2.2  Let  X  be  a  random  variable  with  CDF  F(x).  If  x\  <  X2,  where  x\ 
and  X2  are  known  constants,  P(x\  <  X  <x2)  =  F{x  2)  -  Fix  i)  =  f*2  fix)  Ax. 

Theorem  7.2.3  77z£  probability  of  a  continuous  random  variable  taking  a  particular 
value  xo  is  zero : 

P(X=x0)  =  0.  (7.4) 

The  proof  is  provided  in  Appendix  C.2. 


Example  7.2. 1  Consider  the  continuous  random  variable  “waiting  time  for  the  train”. 
Suppose  that  a  train  arrives  every  20  min.  Therefore,  the  waiting  time  of  a  particular 
person  is  random  and  can  be  any  time  contained  in  the  interval  [0,  20].  We  can  start 
describing  the  required  probability  density  function  as 

k  for  0  <  v  <  20 
J  X  0  otherwise 


where  k  is  an  unknown  constant.  Now,  using  condition  (2)  of  Theorem  7.2.1,  we 
have 

r20 


1  =  /  fix)  dx  =  [kx]l°  =  20  k 


JO 


which  needs  to  be  fulfilled.  This  yields  k  =  1  /20  which  is  always  greater  than  0, 
and  therefore,  condition  (1)  of  Theorem  7.2.1  is  also  fulfilled.  It  follows  that 


fix)  = 


20 

0 


for  0  <  v  <  20 
otherwise 


is  the  probability  density  function  describing  the  waiting  time  for  the  train.  We  can 
now  use  Definition  7.2.2  to  determine  the  cumulative  distribution  function: 


F(x)  = 


f(t)dt  = 


1 

20 


Mo  = 


1 

20 


v. 


Suppose  we  are  interested  in  calculating  the  probability  of  a  waiting  time  between 
15  and  20  min.  This  can  be  calculated  using  Theorem  7.2.2: 


20  15 

P(15  <  X  <  20)  =  F(20)  -  F(15)  = - =  0.25. 

20  20 

We  can  obtain  this  probability  from  the  graph  of  the  CDF  as  well,  see  Fig.  7.1  where 
both  the  PDF  and  CDF  of  this  example  are  illustrated. 


Defining  a  function,  for  example  the  CDF,  is  simple  in  R:  One  can  use  the 
function  command  followed  by  specifying  the  variables  the  function  evaluates 
in  round  brackets  (e.g.  x)  and  the  function  itself  in  braces  (e.g.  v /20).  Functions  can 
be  plotted  using  the  curve  command: 

cdf  <-  function(x) {1/20  *  x} 
curve (cdf , f rom=0 , to=20) 
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(a)  PDF 


(b)  CDF 


Fig.  7.1  Probability  density  function  (PDF)  and  cumulative  distribution  function  (CDF)  for  waiting 
time  in  Example  7.2.1 


Alternatively,  the  plot  command  can  be  used  to  plot  vectors  against  each  other;  for 
example,  after  defining  a  function,  we  can  define  a  sequence 
(x<-seq(0,20,0 .01)),  evaluate  this  sequence  via  the  specified  function 
(cdf  (x) ),  and  plot  them  against  each  other  and  connect  the  points  from  the  sequence 
with  a  line  (plot  (x ,  cdf  (x) ,  type=  *  1 J ) ). 

This  example  illustrates  how  the  cumulative  distribution  function  can  be  used 
to  obtain  probabilities  of  interest.  Most  importantly,  if  we  want  to  calculate  the 
probability  that  the  random  variable  X  takes  values  in  the  interval  [x\ ,  *2],  we  simply 
have  to  look  at  the  difference  of  the  respective  CDF  values  at  x\  and  X2.  Figure  7.2a 
highlights  that  the  interval  probability  corresponds  to  the  difference  of  the  CDF 
values  on  the  y-axis. 

We  can  also  use  the  probability  density  function  to  visualize  P(x  \  <  X  <  X2). 
We  know  from  Theorem  7.2.1  that  f ^  f(x)dx  =  1,  and  therefore,  the  area  under 
the  PDF  equals  1.  Thus,  we  can  interpret  interval  probabilities  as  the  area  under  the 
PDF  between  x\  and  X2.  This  is  presented  in  Fig.  7.2b. 


7.2.2  CDF  of  Discrete  Random  Variables 

Definition  7.2.3  A  random  variable  X  is  defined  to  be  discrete  if  its  probability 
space  is  either  finite  or  countable,  i.e.  if  it  takes  only  a  finite  or  countable  number  of 
values.  Note  that  a  set  V  is  said  to  be  countable,  if  its  elements  can  be  listed,  i.e. 
there  is  a  one-to-one  correspondence  between  V  and  the  positive  integers. 

Example  7.2.2  Consider  the  example  of  tossing  of  a  coin  where  each  trial  results 
in  either  a  head  ( H )  or  a  tail  (T),  each  occurring  with  the  same  probability 
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Fig.  7.2  Graphical  representation  of  the  probability  P(x\  <  X  <  X2)  a  via  the  CDF  and  b  via  the 
PDF* 

0.5.  When  the  coin  is  tossed  multiple  times,  we  may  observe  sequences  such  as 

H,  T,  H,  H,  T,  H,  H,T,  and  T , _ The  sample  space  is  £2  =  {H,  T}.  Let  the 

random  variable  X  denote  the  number  of  trials  required  to  get  the  third  head,  then 
X  =  4  for  the  given  sequence  above.  Clearly,  the  space  of  X  is  the  set  (3,  4,  5, . . .). 
We  can  see  that  X  is  a  discrete  random  variable  because  its  space  is  finite  and  can 
be  counted.  We  can  also  assign  certain  probabilities  to  each  of  these  values,  e.g. 
P(X  =  3)  =  pi  and  P(X  =  4)  =  p2. 


Definition  7.2.4  Let  X  be  a  discrete  random  variable  which  takes  k  different  values. 
The  probability  mass  function  (PMF)  of  X  is  given  by 

f(X)  =  P{X  =  Xi)  =  pi  for  each  i  =  1,2 , ,k. 

It  is  required  that  the  probabilities  pi  satisfy  the  following  conditions: 

(1)  0  <Pi<  1, 

(2)  zti  Pi  =  i. 


(7.5) 


7.2  Cumulative  Distribution  Function  (CDF) 
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Definition  7.2.5  Given  (7.5),  we  can  write  the  CDF  of  a  discrete  random  variable  as 

k 

F(x)  =  ^  I{Xi<x)Pi,  (7-6) 

i=l 

where  I  is  an  indicator  function  defined  as 

_  1  if  Xi  <  x 

q  otherwise. 

The  CDF  of  a  discrete  variable  is  always  a  step  function. 

Working  with  the  CDF  for  Discrete  Random  variables 

We  can  easily  calculate  various  types  of  probabilities  for  discrete  random  variables 


using  the  CDF.  Let  a  and  b  be  some  known  constants,  then 

P(X  <  a)  =  F{a ),  (7.7) 

P(X  <  a)  =  P(X  <a)~  P(X  =  a)  =  F(a)  -  P(X  =  a),  (7.8) 

P(X  >  a)  =  1  -  P(X  <  a)  =  1  -  Fia),  (7.9) 

P(X  >  a)  =  1  -  P(X  <a)  =  1  -  F(q)  +  P(X  =  a ),  (7.10) 

P(a  <  X  <b)  =  P(X  <b)~  P(X  <  a) 

=  F{b)  -  F(a)  +  P(X  =  a ),  (7.11) 

P(a  <  X  <  b)  =  F{b)  -  Fia),  (7.12) 

Pia  <  X  <  b)  =  Fib)  -  Fia)  -  P(X  =  b),  (7.13) 

P(a  <  X  <  b)  =  Fib)  -  Fia)  -  PiX  =  b)  +  PiX  =  a).  (7.14) 


Remark  7.2.1  The  Eqs.  (7.7)-(7. 14)  can  also  be  used  for  continuous  variables,  but  in 
this  case,  P(X  =  a)  =  PiX  =  b)  =  0  (see  Theorem  7.2.3),  and  therefore,  Eqs.  (7.7)- 
(7.14)  can  be  modified  accordingly. 

Example  7.2.3  Consider  the  experiment  of  rolling  a  die.  There  are  six  possible  out¬ 
comes.  If  we  define  the  random  variable  X  as  the  number  of  dots  observed  on 
the  upper  surface  of  the  die,  then  the  six  possible  outcomes  can  be  described  as 
x\  =  1,  X2  =  2,  . . . ,  X6  =  6.  The  respective  probabilities  are  P(X  =  Xf)  =  1/6;  i  = 
1,2, ...  ,6.  The  PMF  and  CDF  are  therefore  defined  as  follows: 


fix)  = 


1/6 

if 

x  =  l 

0 

if 

—  00 

< 

V 

< 

1 

1/6 

if 

x  =  2 

1/6 

if 

1 

< 

V 

< 

2 

1/6 

if 

x  =  3 

2/6 

if 

2 

< 

V 

< 

3 

1/6 

if 

x  =  4  Fix)  =  < 

3/6 

if 

3 

< 

V 

< 

4 

1/6 

if 

x  =  5 

4/6 

if 

4 

< 

V 

< 

5 

1/6 

if 

x  =  6 

5/6 

if 

5 

< 

V 

< 

6 

0 

elsewhere. 

1 

if 

6 

< 

V 

< 

00. 
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(a)  PDF 


(b)  CDF 


Fig.  7.3  Probability  density  function  and  cumulative  distribution  function  for  rolling  a  die  in 
Example  7.2.3.  relates  to  an  included  value  and  “o”  to  an  excluded  value 


Both  the  CDF  and  the  PDF  are  displayed  in  Fig.  7.3. 

We  can  use  the  CDF  to  calculate  any  desired  probability,  e.g.  P(X  <  5)  = 
F( 5)  =  5/6.  This  is  shown  in  Fig.  7.3b  where  for  X  =  5,  we  obtain  F( 5)  =  5/6 
when  evaluating  on  the  y-axis.  Similarly,  P( 3  <  X  <  5)  =  F( 5)  -  F (3)  = 
(5/6)  —  (3/6)  =  2/6  can  be  interpreted  as  the  difference  of  F( 5)  and  F( 3)  on  the 
y-axis. 


7.3  Expectation  and  Variance  of  a  Random  Variable 

We  have  seen  that  both  the  probability  density  function  (or  probability  mass  function) 
and  the  cumulative  distribution  function  are  helpful  in  characterizing  the  features  of 
a  random  variable.  Some  other  features  of  random  variables  are  characterized  by  the 
concepts  of  expectation  and  variance. 


7.3.1  Expectation 

Definition  7.3.1  The  expectation  of  a  continuous  random  variable  X,  having  the 
probability  density  function  f(x)  with  f  \x\f(x)dx  <  oo,  is  defined  as 

/»+o o 


E(X)  = 


xf(x)dx. 


(7.15) 
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For  a  discrete  random  variable  X,  which  takes  the  values  x\ ,  X2 , . . .  with  respective 
probabilities  pi,  pi, . . the  expectation  of  X  is  defined  as 

k 

E(X)  =  =  x\ P(X  =  x\)  +X2 P(X  =  X2)  H - h  XkP(X  =  Xk).  (7.16) 

i= 1 

The  expectation  ofX,i.e.E(X),is  usually  denoted  by  p  =  E(X)  and  relates  to  the 
arithmetic  mean  of  the  distribution  of  the  population.  It  reflects  the  central  tendency 
of  the  population. 


Example  7.3. 1  Consider  again  Example  7.2.1  where  the  waiting  time  for  a  train  was 
described  by  the  following  probability  density  function: 


fix)  = 


20 

0 


for  0  <  x  <  20 
otherwise. 


We  can  calculate  the  expectation  as  follows: 


■00  po 

E(X)  =  /  xf(x)dx  =  /  xf(x)  dx + 

-00  7—00  VO 


20  roo 

xf(x)  dx  +  /  xf(x)dx 


20 


*20 


=  0  + 


1 

20 


v  dx  +  0  = 


1 

40 


■x‘ 


-\  20 


Jo 


400 

^0" 


-0  =  10. 


The  “average”  waiting  time  for  the  train  is  therefore  10  min.  This  means  that  if  a 
person  has  to  wait  for  the  train  every  day,  then  the  person  will  experience  waiting 
times  varying  randomly  between  0  and  20  min  and,  on  average,  has  to  wait  for  10  min. 


Example  7.3.2  Consider  again  the  die  roll  experiment  from  Example  7.2.3.  The 
probabilities  for  the  occurrence  of  any  v; ,  i  =  1,  2,  . . . ,  6,  are  P(X  =  x;)  =  1/6. 
The  expectation  can  thus  be  calculated  as 

6 

e(x)  =  y,  Xj  pi 

/—i 

=  1  •  P(X  =  1)  +  2  •  P(X  =  2)  +  3  •  P(X  =  3)  +  4  •  P(X  =  4) 

+  5  •  P(X  =  5)  +  6  •  P(X  =  6) 

—  (1+2  +  3  +  4  +  5  +  6)  —  =  — -  =  3.5. 

6  6 


7.3.2  Variance 

The  variance  describes  the  variability  of  a  random  variable.  It  gives  an  idea  about  the 
concentration  or  dispersion  of  values  around  the  arithmetic  mean  of  the  distribution. 
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Definition  7.3.2  The  variance  of  a  random  variable  X  is  defined  as 

Var(X)  =  E[X  -  E(X)]2. 

The  variance  of  a  continuous  random  variable  X  is 


Var(X)  = 


E(X))2f(x)dx 


(7.17) 

(7.18) 


where  E(X)  =  f_°°  x f(x)dx.  Similarly,  the  variance  of  a  discrete  random  variable 
X  is 


Var(X)  =  -  E(X)fPi  (7.19) 

1=1 


where  E(X)  =  Xi pt .  The  variance  is  usually  denoted  by  a2  =  Var(X). 


Definition  7.3.3  The  positive  square  root  of  the  variance  is  called  the  standard 
deviation. 


Example  7.3.3  Recall  Examples  7.2.1  and  7.3.1.  We  can  calculate  the  variance  of 
the  waiting  time  for  a  train  using  the  probability  density  function 


fix)  = 


20 

0 


for  0  <  v  <  20 
otherwise 


and  E(X) 
Var(X)  = 


=  10  (already  calculated  in  Example  7.3.1).  Using  (7.18),  we  obtain: 

poo 

(. x  —  E (x))2  f(x)  dx  =  (x  —  10 )2  f(x)  dx 


•oo 


■CO  J  —oo 

*0  r20 

,2  /•/  x  i  .  /  /  im  2 


•oo 


(x  —  10 )  fix)  dx  +  /  (x  —  10 )  fix)  dx  +  /  (x  —  10 )  fix)  dx 
oo  JO  J  20 

20  ^  p20  ^ 

0+1  (x  -  10)2  •  —  dx  +  0  =  /  —  (x2  -  20x  +  100) dx 


20  \3 


20 


-v3  —  10v2  +  lOOv 


20 


)] 


20 


0 


1 

=  33-. 
3 


The  standard  deviation  is  ^33^  min2  ~  5.77  min. 


Recall  that  in  Chap.  3,  we  introduced  the  sample  variance  and  the  sample  stan¬ 
dard  deviation.  We  already  know  that  the  standard  deviation  has  the  same  unit  of 
measurement  as  the  variable,  whereas  the  unit  of  the  variance  is  the  square  of  the 
measurement  unit.  The  standard  deviation  measures  how  the  values  of  a  random  vari¬ 
able  are  dispersed  around  the  population  mean.  A  low  value  of  the  standard  deviation 
indicates  that  the  values  are  highly  concentrated  around  the  mean.  A  high  value  of 
the  standard  deviation  indicates  lower  concentration  of  the  data  values  around  the 
mean,  and  the  observed  values  may  be  far  away  from  the  mean.  These  considerations 
are  helpful  in  making  connections  between  random  variables  and  samples  of  data, 
see  Chap.  9  for  the  construction  of  confidence  intervals. 


7.3  Expectation  and  Variance  of  a  Random  Variable 
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Example  7.3.4  Recall  Example  7.3.2  where  we  calculated  the  expectation  of  a  die 
roll  experiment  as  E(X)  =  3.5.  With  jc*  e  {1,  2,  3,  4,  5,  6}  and  pi  =  1/6  for  all  i  = 
1,  2,  3,  4,  5,  6,  the  variance  for  this  example  corresponds  to 

Var(X)  =  -  E (X)fPi  =  (1  -  3.5)2  •  1  +  (2  -  3.5)2  •  1  +  (3  -  3.5)2  •  1 

/  —  I 

+(4  -  3.5)2  •  -  +  (5  -  3.5)2  •  -  +  (6  -  3.5)2  •  -  %  2.92. 

6  6  6 

Theorem  7.3.1  The  variance  of  a  random  variable  X  can  be  expressed  as 

Var(X)  =  E(X2)  -  [E(X)]2.  (7.20) 

The  proof  is  given  in  Appendix  C.2. 


Example  7.3.5  In  Examples  7.2.1,  7.3.1,  and  7.3.3,  we  evaluated  the  waiting  time 
for  a  train  using  the  PDF 


f(X)  = 


_j_ 

20 

0 


for  0  <  X  <  20 
otherwise. 


We  calculated  the  expectation  and  variance  in  Eqs.  (7.15)  and  (7.17)  as  10  min  and 
33^  min2,  respectively.  Theorem  7.3.1  tells  us  that  we  can  calculate  the  variance  in 
a  different  way  as  follows: 


■oo 


*20 


E(XZ)  =  /  xzf(x)dx  = 


■CO 


1 

20 


x2  dx 


1 

60 


■X' 


-\  20 


Jo 


1 

=  133- 
3 


1 


Var(X)  =  E(X2)  -  [E(X)]2  =  133-  -  102  = 
This  yields  the  same  result  as  Eq.  (7.18)  but  is  much  quicker. 


1 

33- 

3 


7.3.3  Quantiles  of  a  Distribution 

We  introduced  the  concept  of  quantiles  in  Chap.  3,  Sect.  3.1.2.  Now,  we  define 
quantiles  in  terms  of  the  distribution  function. 

Definition  7.3.4  The  value  xp  for  which  the  cumulative  distribution  function  is 

F(xp)  =  p  (0  <  p  <  1)  (7.21) 


is  called  the  p-quantile. 
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Fig.  7.4  First  quartile, 
median,  and  third  quartile* 


It  follows  from  Definition  7.3.4  that  xp  is  the  value  which  divides  the  cumulative 
distribution  function  into  two  parts:  the  probability  of  observing  a  value  left  of  xp  is 
p ,  whereas  the  probability  of  observing  a  value  right  of  xp  is  1  —  p.  For  example,  the 
0.25-quantile  vo.25  describes  the  v-value  for  which  the  probability  of  observing  vo.25 
or  any  smaller  value  is  0.25.  Figure  7.4  shows  the  0.25-quantile  (first  quartile),  the  0.5- 
quantile  (median),  and  the  0.75-quantile  (third  quartile)  in  a  cumulative  distribution 
function. 


Example  7.3.6  Recall  Examples  7.2.1, 7.3.1, 7.3.5  and  Fig.  7.1b  where  we  described 
the  waiting  time  for  a  train  by  using  the  following  CDF: 


F(x)  = 


1 

20 


v. 


The  first  quartile  vo.25  is  5  because  F( 5)  =  5/20  =  0.25.  This  means  that  the 
probability  of  waiting  for  the  train  for  5  min  or  less  is  25  %  and  of  waiting  for  longer 
than  5  min  is  75  %. 


For  continuous  variables,  there  is  a  unique  value  which  describes  the  p -quantile. 
However,  for  discrete  variables,  this  may  not  necessarily  be  true.  In  this  case,  the 
p-quantile  is  chosen  such  that 

F(xp )  >  p, 

Fix)  <  p  for  x  <  xp 

holds. 


Example  7.3. 7  The  cumulative  distribution  function  for  rolling  a  die  is  described  in 
Example  7.2.3  and  Fig.  7.3b.  The  first  quartile  vo.25  is  2  because  F( 2)  =  2/6  >  0.25 
and  F(x)  <  0.25  forv  <  2. 


7.3.4  Standardization 

Standardization  transforms  a  random  variable  in  such  a  way  that  it  has  an  expectation 
of  zero  and  a  variance  of  one.  More  details  on  the  need  for  standardization  are 
discussed  in  Chap.  10. 


7.4  Tschebyschev's  Inequality 
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Definition  7.3.5  A  random  variable  Y  is  called  standardized  when 

E(Y)  =  0  and  Var(T)  =  1. 


Theorem  7.3.2  Suppose  a  random  variable  X  has  mean  E(X)  =  p  and  Var(X)  = 
a2.  Then,  it  can  be  standardized  as  follows: 


X  -  p  _  X  -  E(X) 
(J  V^ar(X) 


(7.22) 


Example  7.3.8  In  Examples  7. 2.1,  7.3.1,  and  7.3.5,  we  considered  the  waiting  time 
X  for  a  train.  The  random  variable  X  can  take  values  between  0  and  20  min,  and  we 
calculated  E(X)  =  10  and  Var(X)  =  33^.  The  standardized  variable  of  X  is 


X- p 
a 


X  -  10 


One  can  show  that  E(T)  =  0  and  Var(T)  =  1,  see  also  Exercise  7.10  for  more 
details. 


7.4  Tschebyschev's  Inequality 

If  we  do  not  know  the  distribution  of  a  random  variable  X ,  we  can  still  make  state¬ 
ments  about  the  probability  that  X  takes  values  in  a  certain  interval  (which  has  to 
be  symmetric  around  the  expectation  p)  if  the  mean  p  and  the  variance  a2  of  X  are 
known. 


Theorem  7.4.1  (Tschebyschev’s  inequality)  Let  X  be  a  random  variable  with 
E(X)  =  p  and  Var(X)  =  a2.  It  holds  that 

Var(X) 


This  is  equivalent  to 


P(\X-p\>c)  < 


P(\X  -  p\  <c)>  1  - 


Var(X) 


(7.23) 


(7.24) 


The  proof  is  given  in  Appendix  C.2. 


Example  7.4.1  In  Examples  7.2.1,  7.3.1,  and  7.3.5,  we  have  worked  with  a  random 
variable  which  describes  the  waiting  time  for  a  train.  We  determined  E(X)  =  10  and 
Var(X)  =  33^.  We  can  calculate  the  probability  of  waiting  between  10  —  7  =  3  and 
10  +  7  =  17  min: 

Var(X) 

P(\X-p\  <  c)  >  1 - ^ 

_33| 
l2 


P(\X  -  10|  <  7)  >  1 


0.32. 
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The  probability  is  therefore  at  least  0.32.  However,  if  we  apply  our  distributional 
knowledge  that  F(x )  =  (for  0  <  X  <  20),  then  we  obtain  a  much  more  precise 
result  which  is 

17  3 

P(  3  <X  <  17)  =  F(17)  -  F(3)  = - =  0.7. 

20  20 

We  can  clearly  see  that  Tschebyschev’s  inequality  gives  us  the  correct  answer,  that 
is  P(3  <  X  <  17)  is  greater  0.32.  Nevertheless,  the  approximation  to  the  exact 
probability,  0.7,  is  rather  poor.  One  needs  to  keep  in  mind  that  only  the  lack  of 
distributional  knowledge  makes  the  inequality  useful. 


7.5  Bivariate  Random  Variables 

There  are  many  situations  in  which  we  are  interested  in  analysing  more  than  one 
variable,  say  two  variables.  When  we  have  more  than  one  variable,  then  not  only 
their  individual  distributions  but  also  their  joint  distribution  can  be  of  interest.  For 
example,  we  know  that  driving  a  car  after  drinking  alcohol  is  not  necessarily  safe.  If 
we  consider  two  variables,  the  blood  alcohol  content  X  and  number  of  car  accidents 
Y,  then  we  may  be  interested  in  the  probability  of  having  a  high  blood  alcohol 
content  and  a  car  accident  at  the  same  time.  If  we  analyse  ( X ,  Y)  jointly,  then  we  are 
interested  in  their  joint  bivariate  distribution  /xy(x,  y).  This  distribution  can  either 
be  discrete  or  continuous. 

Discrete  Bivariate  Random  Variables.  Suppose  we  have  two  categorical  variables 
X  and  Y  which  can  take  the  values  x\,  X2,  . . . ,  x\  and  yi,  y2,  •  •  • ,  yj,  respectively. 

Their  joint  probability  distribution  function  is  characterized  by 

P(X  =  Xi,  Y  =  yj)  =  pij  (i  =  1,  2,  ...,/;  j  =  1,2,...,/) 

with  Z/=i  Zf=i  Pij  =  1 .  This  means  that  the  probability  of  observing  xi  and  yj 
together  is  pij .  We  can  summarize  this  information  in  a  contingency  table  as  follows: 


Y 

1  2  ...  / 

Total 

1 

pn  pn  •••  pij 

P\+ 

2 

P21  P22  •••  P2J 

P2+ 

... 

X 

pn  P12  •  •  •  pij 

P1+ 

Total 

P+ 1  P+2  ■  •  •  P+J 

1 

Each  cell  contains  a  “piece”  of  the  joint  distribution.  The  entries  p+\,  p+ 2,  . . . , 
p+ j  in  the  bottom  row  of  the  table  summarize  the  marginal  distribution  of  Y,  which 
is  the  distribution  of  Y  without  giving  reference  to  X.  The  entries  pi+,  p+2,  • . . ,  pi+ 
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in  the  last  column  summarize  the  marginal  distribution  of  X.  The  marginal  distribu¬ 
tions  can  therefore  be  expressed  as 


j 

P(X  =  Xi )  =  ^  Pij  =  Pi+  i  =  1,2,  . . . ,  I, 

j= 1 
/ 

P(Y  =  yj )  =  y'.Pij  =  P+J  7  =  1,2,...,/. 

/  —  I 


The  conditional  distributions  of  X  given  Y  =  yj  and  Y  given  X  =  xj  are  given 
as  follows: 


P(X  =  Xi\Y  =  y/)  =  Pi\j 
P(Y  =  yj\X  =  Xi)  =  Pj\i 


Pij 

P+j 

Pij 

Pi+ 


i  =  1,2, 
j  1,2,...,*/. 


They  summarize  the  distribution  of  X  for  a  given  value  of  yy  (or  the  distribution  of  Y 
for  a  given  value  of  Xi )  and  play  a  crucial  role  in  the  construction  of  regression  models 
such  as  the  linear  regression  model  introduced  in  Chap.  11.  Please  also  recall  the 
definitions  of  Sect.  4.1  where  we  introduced  conditional  and  marginal  distributions 
for  data  samples  rather  than  random  variables. 


Example  7.5.1  Suppose  we  have  a  contingency  table  on  smoking  behaviour  X 
(1  =  never  smoking,  2  =  smoking  sometimes,  and  3  =  smoking  regularly) 
and  education  level  Y  (1  =  primary  education,  2  =  Secondary  education,  and 
3  =  tertiary  education): 


Y 

1  2  3 

Total 

X  1 

0.10  0.20  0.30 

0.60 

2 

0.10  0.10  0.10 

0.30 

3 

0.08  0.01  0.01 

0.10 

Total 

0.28  0.31  0.41 

1 

The  cell  entries  represent  the  joint  distribution  of  smoking  behaviour  and  education 
level.  We  can  interpret  each  entry  as  the  probability  of  observing  X  =  Xi  and 
Y  =  yj  simultaneously.  For  example,  P23  =  P  (“smoking  sometimes  and  tertiary 
education”)  =  0.10.  The  marginal  distribution  of  X  is  contained  in  the  last  column 
of  the  table  and  lists  the  probabilities  of  smoking  (unconditional  on  education  level), 
e.g.  the  probability  of  being  a  non-smoker  in  this  population  is  60  %.  We  can  also 
interpret  the  conditional  distributions:  P(X\Y  =  3)  represents  the  distribution  of 
smoking  behaviour  among  those  who  have  tertiary  education.  If  we  are  interested 
in  the  probability  of  smoking  sometimes  given  tertiary  education  is  completed,  then 
we  calculate  P(X  =  2\Y  =  3)  =  p2\3  =  =  0.24. 
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V2 


y  i 


X\ 


x2 


Fig.  7.5  Area  covering  all  points  of  ( X ,  Y)  with  (x\  <  X  <  x2,  yi  <  Y  <  y2)* 


Continuous  Bivariate  Random  Variables. 

Definition  7.5.1  A  bivariate  random  variable  ( X ,  Y)  is  continuous  if  there  is  a  func¬ 
tion  /xy(x,  y)  such  that 

FXY(x,y)  =  P(X  <  x,Y  <  y)  =  (  f  fxY(x,y)dxdy  (7.25) 

7—00  7—00 

holds. 

The  function  Fxy(x,  y)  is  the  joint  cumulative  distribution  function  of  X  and 

Y ;  the  joint  distribution  function  is  denoted  by  /xy(x,  y),  and  /xy(x,  y)  has  to  fulfil 
the  usual  conditions  of  a  density  function.  Necessary  and  sufficient  conditions  that 
a  function  Fxy(x ,  y)  is  a  bivariate  cumulative  distribution  function  are  as  follows: 


lim  FXy(x,y)  =  0 

oo 

lim  FXy(x,  y)  =  1 

X^OO 


lim  FXv(x,y)=  0 

y^—oo 

lim  FXy(x,  y)  =  1 

y^oo 


and  F(x 2,  yi)  ~  F(x  1, y2)  -  ^(*2,  yi)  +  F(*i,  yi)  >  0  for  all  x\  <  *2,  yi  <  y2. 


The  last  condition  is  sometimes  referred  to  as  the  rectangle  inequality.  As  in 
the  univariate  case,  we  can  use  the  cumulative  distribution  function  to  calcu¬ 
late  interval  probabilities;  similarly,  we  look  at  the  rectangular  area  defined  by 
(x\,  yi),  (jci,  y2),  (X2,  y  1),  and  (x2,  y2)  in  the  bivariate  case  (instead  of  an  interval 
[a,  b]),  see  Fig.  7.5. 

We  can  calculate  the  desired  probabilities  as  follows: 

ryi  rx  2 

P(x  1  <  X  <  x2,  yi  <  Y  <  y2)  =  /  /xy(x,  y)  dx  dy. 

J  yi  Jx\ 

The  marginal  distributions  of  X  and  Y  are 

rCO  roc 

fx(x)  =  /  fxY(x,y)  dy,  fY(y)  =  /  fxY(x,y)  dx, 
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(a)  Joint  distribution  fxy(x,y) 


(b)  Marginal  distribution  fx(x) 


Fig.  7.6  Joint  and  marginal  distribution  for  Example  7.5.2 


respectively.  Similar  to  the  discrete  case,  fx(x)  and  fy(y)  also  describe  the  distri¬ 
bution  of  X  unconditional  on  Y  and  the  distribution  of  Y  unconditional  on  X.  The 

cumulative  marginal  distributions  are 

Fx(x )  =  f  fx(t)dt,  Fy(y)  =  f  fy(t)dt. 

J—oo  7—00 


The  conditional  distributions  can  be  obtained  by  the  ratio  of  the  joint  and  marginal 
distributions: 


fx\r(x,y) 


fix,  y) 

fiy )  ’ 


fv\x(x,y) 


fjx,y) 
fix ) 


Example  7.5.2  Consider  the  function 


/xy(x,  y) 


x  +  y  for  0  <  v  <  1 ,  0  <  y  <  1 

0  elsewhere. 


Suppose  X  and  Y  represent  the  concentrations  of  two  drugs  in  the  human  body. 
Then,  fxy(x,y)  may  represent  the  sum  of  two  drug  concentrations  in  the  human 
body.  Since  there  are  infinite  possible  realizations  of  both  X  and  Y ,  we  represent 
their  joint  distribution  in  a  figure  rather  than  a  table,  see  Fig.  7.6a. 

The  marginal  distributions  for  X  and  Y  can  be  calculated  as  follows: 


fx(x)  = 


•CO 

■CO 

•CO 


fxr(x,y)dy=  /  ( x  +  y)dy  = 


fr  (x)  =  /  fxYix,  y) dx  =  (x  +  y)  dx  = 


—  CO 


1  - 

*.y  +  x  y 

1  2 

-X  +xy 


-il 

.  o 
-|1 

-  0 


1 

~X+2’ 

1 

-y+2' 


Figure  7.6b  depicts  the  marginal  distribution  for  X.  The  slope  of  the  marginal  dis¬ 
tribution  is  essentially  the  slope  of  the  surface  of  the  joint  distribution  shown  in 
Fig.  7.6a.  It  is  easy  to  see  in  this  simple  example  that  the  marginal  distribution  of 
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X  is  nothing  but  a  cut  in  the  surface  of  the  joint  distribution.  Note  that  the  condi¬ 
tional  distributions  fx\ f(x,  y)  and  fy\x(x,  y)  can  be  easily  calculated;  for  example, 
fx\r(x,  y )  =  f{x,  y)/f(y)  =  (x  +  y)/(y  +  0.5). 

Stochastic  Independence. 

Definition  7.5.2  Two  continuous  random  variables  X  and  Y  are  said  to  be  stochas¬ 
tically  independent  if 

fxY(x,y)  =  fx(x)fY(y).  (7.26) 

For  discrete  variables,  this  is  equivalent  to 

P(X  =  xt ,  Y  =  yj)  =  P(X  =  Xi)P(Y  =  yj)  (7.27) 

being  valid  for  all  (/,  j). 


Example  7.5.3  In  Example  7.5.2,  we  considered  the  function 

x  -\-  y  for  0  <  x  <  1 ,  0  <  y  <  1 

fxY  x  ■>  y  q  elsewhere 


with  the  marginal  distributions  of  X  and  Y  as  fx  =  x  +  0.5  and  fy  =  y  +  0.5, 
respectively.  Since  fx  -  Iy  =  (x  +  \)(y  +  \)  7^  fxY,  it  follows  that  X  and  Y  are  not 
independent.  The  interpretation  is  that  the  concentrations  of  the  two  drugs  are  not 
independent. 


7.6  Calculation  Rules  for  Expectation  and  Variance 

Calculation  Rules  for  the  Expectation.  For  any  constant  values  a  and  b ,  and  any 


random  variables  X  and  Y,  the  following  rules  hold: 

E(a)  =  a ,  (7.28) 

E(bX)  =  bE(X),  (7.29) 

E(a  +  bX)=a  +  bE(X ),  (7.30) 

E(X  +  Y)  =  E(X)  +  E(7)  (additivity).  (7.31) 

The  proof  of  rule  (7.30)  is  given  in  Appendix  C.2. 


Example  7.6. 1  Consider  again  Example  7.2.3  where  we  illustrated  how  the  outcome 
of  a  die  roll  experiment  can  be  captured  by  a  random  variable.  There  were  6  events, 
and  X  could  take  the  values  x\  =  1,  X2  =  2, ...  ,xe  =  6.  The  probability  of  the 
occurrence  of  any  number  was  P  (X  =  Xi )  =  1  /6,  and  the  expectation  was  calculated 
as  3.5.  Consider  two  different  situations: 


7.6  Calculation  Rules  for  Expectation  and  Variance 
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(i)  Suppose  the  die  takes  the  value  10,  20,  30,  40,  50,  and  60  instead  of  the  values 
1,  2,  3,  4,  5,  and  6.  The  random  variable  Y  =  10X  describes  this  suitably,  and 
its  expectation  is 

E(F)  =  E(10X)  =  10E(X)  =  10  •  3.5  =  35 
which  follows  from  (7.29). 

(ii)  If  we  are  rolling  two  dices  X\  and  X2,  then  the  expectation  for  the  sum  of  the 
two  outcomes  is 

E(X)  =  E(Xi  +  X2)  =  E(Xi)  +  E(X2)  =  3.5  +  3.5  =  7 
due  to  (7.31). 


Calculation  Rules  for  the  Variance.  Let  a  and  b  be  any  known  constants  and  X  be 
a  random  variable  (discrete  or  continuous).  Then,  we  have  the  following  rules: 

Var  (a)  =  0,  (7.32) 

Var  (bX)  =  b2  Var(X),  (7.33) 

Var  (a  +  bX )  =  b 2  Var(X).  (7.34) 

The  proof  of  rule  (7.34)  is  given  in  Appendix  C.2. 


Example  7.6.2  In  Examples  7.2.1,  7.3.1,  7.3.3,  and  7.3.5,  we  evaluated  a  random 
variable  describing  the  waiting  time  for  a  train.  Now,  suppose  that  a  person  first  has 
to  catch  a  bus  to  get  to  the  train  station.  If  this  bus  arrives  only  every  60  min,  then 
the  PDF  of  the  random  variable  Y  denoting  the  waiting  time  for  the  bus  is 

X  for  0  <  v  <  60 
f(Y)  =  60  — 

0  otherwise . 

We  can  use  Eqs.  (7.15)  and  (7.17)  to  determine  both  the  expectation  and  variance 
of  Y.  However,  the  waiting  time  for  the  bus  is  governed  by  the  relation  Y  =  3X 
where  X  is  the  waiting  time  for  the  train.  Therefore,  we  can  calculate  E(F)  = 
E(3X)  =  3E(X)  =  3-10  =  30  min  by  using  rule  (7.29)  and  the  variance  as 
Var(F)  =  Var(3X)  =  32  Var(X)  =  9-33^  =  300  using  rule  (7.33).  The  total 
waiting  time  is  the  sum  of  the  two  waiting  times. 


7.6.1  Expectation  and  Variance  of  the  Arithmetic  Mean 

Definition  7.6.1  We  define  the  random  variables  X\,  X2, . . . ,  Xn  to  be  i.i.d.  (inde¬ 
pendently  identically  distributed),  if  all  X;  follow  the  same  distribution  and  are 
stochastically  independent  of  each  other. 

Let  X\,  X2,  . . . ,  Xn  be  n  i.i.d.  random  variables  with  E(X/)  =  p  and  Var(X/)  = 
a2,  i  =  1,2 ,  ...  ,n.  The  arithmetic  mean  of  these  variables  is  given  by 
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which  is  again  a  random  variable  that  follows  a  distribution  with  certain  expectation 
and  variance.  A  function  of  random  variables  is  called  a  statistic.  By  using  (7.29) 
and  (7.31),  we  obtain 


E(X)  = 


1  " 

-Ye  (Xi)  =  fi. 

n  f-* 

1  =  1 


(7.35) 


If  we  apply  (7.34)  and  recall  that  the  variables  are  independent  of  each  other,  we  can 
also  calculate  the  variance  as 


Var(X)  = 


j^Vard,) 

i  = 1 


(7.36) 


Example  7.6.3  If  we  toss  a  coin,  we  obtain  either  head  or  tail,  and  therefore, 
/^(“head”)  =  P(“ tail”)  =  \.  If  we  toss  the  coin  n  times,  we  have  for  each  toss 


0  for  “tail” 

1  for  “head”  ’ 


It  is  straightforward  to  calculate  the  expectation  and  variance  for  each  coin  toss: 


E (Xt)  =  0  •  ^  +  1  •  ^ 
Var (Xt)  =  (0  -  l)2  •  I 


_  1 
=  2’ 

+  (1- 


11  1  1  _  1 
4  '  2  +  4  ’  2  _  4' 


The  arithmetic  mean  X  =  ^  X/=i  %i  describes  the  relative  frequency  of  heads  when 
the  coin  is  tossed  n  times.  We  can  now  apply  (7.35)  and  (7.36)  to  calculate 


and 


1 

E(X)  =  -Vl/2=l/2 

n  “ 

i=i 


1  n  1 

Var(X)  =  -2  X  I 

i=l 


1 

4ft 


With  this  example,  the  interpretation  of  formulae  (7.35)  and  (7.36)  becomes  clearer: 
if  the  probability  of  head  is  0.5  for  a  single  toss,  then  it  is  also  0.5  for  the  mean  of  all 
tosses.  If  we  toss  a  coin  many  times,  then  the  variance  decreases  when  n  increases. 
This  means  that  a  larger  sample  size  yields  a  higher  precision  for  the  calculated 
arithmetic  mean.  This  observation  shows  the  basic  conclusion  of  the  next  chapter: 
the  higher  the  sample  size,  the  more  secure  we  are  of  our  conclusions. 


7.7  Covariance  and  Correlation 

The  variance  measures  the  variability  of  a  variable.  Similarly,  the  covariance  mea¬ 
sures  the  covariation  or  association  between  X  and  Y. 


7.7  Covariance  and  Correlation 
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7.7.1  Covariance 

Definition  7.7.1  The  covariance  between  X  and  Y  is  defined  as 

q  =  Cov(X,  Y)  =  E[(X  -  E(X))(F  -  E(F))].  (7.37) 

The  covariance  is  positive  if,  on  average,  larger  values  of  X  correspond  to  larger 
values  of  Y ;  it  is  negative  if,  on  average,  greater  values  of  X  correspond  to  smaller 
values  of  Y. 

The  probability  density  function  of  any  bivariate  random  variable  ( X ,  Y)  is  char¬ 
acterized  by  the  expectation  and  variance  of  both  X  and  Y , 

E(X)  =  fj,x,  Var(X)  =  a\, 

E  (Y)  =  hy,  Var(y)  =  a\, 

as  well  as  their  covariance.  We  can  summarize  these  features  by  using  the  expectation 
vector 

and  the  covariance  matrix 

C  (X \  =  tCov(X’  X)  CoW  E)\  (a2x  q\ 

V  Y  )  V  Cov(y,  x)  Cov(y,  E)  )  \g  Or/’ 

Important  properties  of  covariance  are 

(i)  Cov(X,  Y)  =  Co v(T,  X), 

(ii)  Cov(X,  X)  =  Var(X), 

(iii)  Co\(aX  +  b,cY  +  d)  =  ac  Cov(X,  Y ), 

(iv)  Cov(X,  Y)  =  E(XT)  —  E(X)E(F)  where  E(Xy)  =  f  f  xyf(x,  y)dxdy  for 
continuous  variables  and  E(XT)  =  X,  £  /  Xi  yj  pij  for  discrete  variables, 

(v)  If  X  and  Y  are  independent,  it  follows  that  E(XF)  =  E(X)E(F)  =  pxHY,  and 
therefore,  Cov(X,  Y)  =  px Hy  —  fixfiy  =  0. 

Theorem  7.7.1  (Additivity  Theorem)  The  variance  of  the  sum  ( subtraction )  of  X 
and  Y  is  given  by 

Var(X  ±Y)  =  Var(X)  +  Var(F)  ±  2Cov(X,  Y). 


If  X  and  Y  are  independent,  it  follows  that  Cov(X,  Y)  =  0  and  therefore  Var(X  ± 
Y)  =  Var(X)  +  Var(F).  We  omit  the  proof  of  this  theorem. 


Example  7.7.1  Recall  Example  7.6.2  where  we  considered  the  waiting  time  F  for 
a  bus  to  the  train  station  and  the  waiting  time  X  for  the  waiting  time  for  a  train. 
Suppose  their  joint  bivariate  probability  density  function  can  be  written  as 


fxr(x,  y ) 


for  0  <  x  <  60,  0  <  y  <  20 

0  elsewhere. 
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To  calculate  the  covariance  between  X  and  Y,  we  need  to  calculate  E(XT): 

*20  Y 


E  (XY)  = 


•oo  ro o  /*60 

xyf(x,y)dxdy=  / 

—  oo  J  —  oo  J  0 

•60  400x 


vy - 

0  1200 


dx  dy 


*60  r  2  n^O 

ou  r  v  yz  n 


0  L1200  2  Jo 


dy  = 


2400 


dy  = 


lr 


2 -|  60 


6  2 


Jo 


3600 


=  300 


We  know  from  Example  7.6.2  that  E(X)  =  10,  E(T)  =  30,  Var(X)  =  33^,  and 
Var(T)  =  300.  The  covariance  is  thus 


Cov(X,  Y)  =  E (XY)  -  E(X)E(T)  =  300  -  30  •  10  =  0. 


This  makes  sense  as  the  waiting  times  for  the  train  and  the  bus  should  be  independent 
of  each  other.  Using  rule  (7.31),  we  conclude  that  the  total  expected  waiting  time  is 

E(X  +  Y)  =  E(X)  +  E(T)  =  10  +  30  =  40  min. 

The  respective  variance  is 

1  1 

Var(X  +  Y)  =  Var(X)  +  Var(T)  -  2Cov(X,  Y)  =  33-  +  300  -  2  *  0  =  333- 
due  to  Theorem  7.7.1. 


7.7.2  Correlation  Coefficient 


Definition  7.7.2  The  correlation  coefficient  of  X  and  Y  is  defined  as 


P(X,  Y) 


Cov(X,  Y) 
VVar(X)  VarfFj’ 


(7.38) 


We  already  know  from  Chap.  4  that  the  correlation  coefficient  is  a  measure  of 
the  degree  of  linear  relationship  between  X  and  Y.  It  can  take  values  between  —  1 
and  1,-1  <  p(X,  Y)  <  1.  However  in  Chap.  4,  we  considered  the  correlation  of 
two  samples,  i.e.  realizations  of  random  variables;  here,  we  describe  the  correlation 
coefficient  of  the  population.  If  p(X,  Y)  =  0,  then  X  and  Y  are  said  to  be  uncorrelated. 
If  there  is  a  perfect  linear  relationship  between  X  and  Y,  then  p  =  1  for  a  positive 
relationship  and  p  =  —  1  for  a  negative  relationship,  see  Appendix  C.2  for  the  proof. 


Theorem  7.7.2  If  X  and  Y  are  independent,  they  are  also  uncorrelated.  However, 
if  they  are  uncorrelated  then  they  are  not  necessarily  independent. 


Example  7.7.2  In  Example  7.6.2,  we  estimated  the  covariance  between  the  waiting 
time  for  the  bus  and  the  waiting  time  for  the  train:  Cov(X,  Y)  =  0.  The  correlation 
coefficient  is  therefore  also  0  indicating  no  linear  relationship  between  the  waiting 
times  for  bus  and  train. 


7.9  Exercises 
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7.8  Key  Points  and  Further  Issues 


Note: 


/  Note  that  there  is  a  difference  between  the  empirical  cumulative  distri¬ 
bution  function  introduced  in  Chap.  2  and  the  CDF  introduced  in  this 
chapter.  In  Chap.  2,  we  work  with  the  cumulative  distribution  of  observed 
values  in  a  particular  sample,  whereas  in  this  chapter,  we  deal  with  ran¬ 
dom  variables  modelling  the  distribution  of  a  general  population. 

/  The  expectation  and  the  variance  of  a  random  variable  are  defined  as 
follows: 


Expectation 

Variance 

Discrete 

'sr^n 

Z,  =  l  xi  Pi 

E?=l(*«-  -B(X))2Pi 

Continuous 

E™xf(x)dx 

f^(x  -E(X))2f(x)dx 

/  Some  important  calculation  rules  are: 

E{a  +  bX)  =  a  +  bE(X)\  Var(a  +  bX)  =  b2  Var(X); 

E(X  +  Y)  =  E(X)  +  E(F);  Var(X  ±Y)  =  Var(X)  +  Var(F) 

±2Cov(X,  Y) 

/  Bivariate  random  variables  ( X ,  F)  have  a  joint  CDF  Fxf(v,  y )  which 
specifies  the  probability  P  (X  <x\Y<  y).  The  conditional  distribution 
of  X\Y  [Y\X]  is  the  PDF  of  X  [Y]  for  a  given  value  Y  =  y  [X  =  x]. 
The  marginal  distribution  of  X  [Y]  is  the  distribution  of  X  [Y]  without 
referring  to  the  values  of  F  [X]. 


7.9  Exercises 


Exercise  7. 1  Consider  the  following  cumulative  distribution  function  of  a  random 
variable  X : 


F(x)  = 


0  if  v  <  2 

—  \x2  -\-2x  —  3  if  2<v<4 
1  if  v  >  4. 


(a)  What  is  the  PDF  of  X  ? 

(b)  Calculate  P(X  <  3)  and  P(X  =  4). 

(c)  Determine  E(X)  and  Var(X). 
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Exercise  7.2  Joey  manipulates  a  die  to  increase  his  chances  of  winning  a  board  game 
against  his  friends.  In  each  round,  a  die  is  rolled  and  larger  numbers  are  generally  an 
advantage.  Consider  the  random  variable  X  denoting  the  outcome  of  the  rolled  die 
and  the  respective  probabilities  P(X  =  1  =  2  =  3  =  5)  =  1/9,  P(X  =  4)  =  2/9, 
and  P(X  =  6)  =  3/9. 

(a)  Calculate  and  interpret  the  expectation  and  variance  of  X. 

(b)  Imagine  that  the  board  game  contains  an  action  which  makes  the  players  use 
l/X  rather  than  X.  What  is  the  expectation  of  Y  =  l/X2ls  E(Y)  =  E(l/X)  = 
1/E(X)7 

Exercise  7.3  An  innovative  winemaker  experiments  with  new  grapes  and  adds  a 
new  wine  to  his  stock.  The  percentage  sold  by  the  end  of  the  season  depends  on  the 
weather  and  various  other  factors.  It  can  be  modelled  using  the  random  variable  X 
with  the  CDF  as 

0  if  v  <  0 

F(x)  —  <  3x2  —  2x3  if  0  <  v  <  1 

1  if  v  >  1 . 

(a)  Plot  the  cumulative  distribution  function  with  R. 

(b)  Determine  fix). 

(c)  What  is  the  probability  of  selling  at  least  one-third  of  his  wine,  but  not  more 
than  two  thirds? 

(d)  Define  the  CDF  in  R  and  calculate  the  probability  of  c)  again. 

(e)  What  is  the  variance  of  X  ? 


Exercise  7.4  A  quality  index  summarizes  different  features  of  a  product  by  means 
of  a  score.  Different  experts  may  assign  different  quality  scores  depending  on  their 
experience  with  the  product.  Let  X  be  the  quality  index  for  a  tablet.  Suppose  the 
respective  probability  density  function  is  given  as  follows: 


fix)  = 


cx( 2  —  x)  if  0  <  v  <  2 
0  elsewhere. 


(a)  Determine  c  such  that  fix)  is  a  proper  PDF. 

(b)  Determine  the  cumulative  distribution  function. 

(c)  Calculate  the  expectation  and  variance  of  X. 

(d)  Use  Tschebyschev’s  inequality  to  determine  the  probability  that  X  does  not 
deviate  more  than  0.5  from  its  expectation. 
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Exercise  7.5  Consider  the  joint  PDF  for  the  type  of  customer  service  X  (0  = 
telephonic  hotline,  1  =  Email)  and  of  satisfaction  score  7(1=  unsatisfied,  2  =  sat¬ 
isfied,  3  =  very  satisfied): 


X\Y 

1 

2 

3 

0 

0 

1/2 

1/4 

1 

1/6 

1/12 

.  0 

(a)  Determine  and  interpret  the  marginal  distributions  of  both  X  and  7. 

(b)  Calculate  the  75  %  quantile  for  the  marginal  distribution  of  Y. 

(c)  Determine  and  interpret  the  conditional  distribution  of  satisfaction  level  for  X  = 

1. 

(d)  Are  the  two  variables  independent? 

(e)  Calculate  and  interpret  the  covariance  of  X  and  Y. 

Exercise  7.6  Consider  a  continuous  random  variable  X  with  expectation  15  and 
variance  4.  Determine  the  smallest  interval  [15  —  c,  15  + c]  which  contains  at  least 
90  %  of  the  values  of  X. 

Exercise  7. 7  Let  X  and  Y  be  two  random  variables  for  which  only  6  possible  events — 
Ai,  A2,  A3,  A4,  A5,  A6 — are  defined: 


i 

1  2  3  4  5  6 

P(Ai ) 

0.3  0.1  0.1  0.2  0.2  0.1 

Xi 

-12  2  -1-12 

Yi 

0  2  0  1  2  1 

(a)  What  is  the  joint  PDF  of  X  and  7? 

(b)  Calculate  the  marginal  distributions  of  X  and  7. 

(c)  Are  both  variables  independent? 

(d)  Determine  the  joint  PDF  for  U  =  X  +  7. 

(e)  Calculate  E(£7)  and  Var (£/)  and  compare  it  with  E(X)  +  E(7)  and  Var(X)  + 
Var(7),  respectively. 


Exercise  7.8  Recall  the  urn  model  we  introduced  in  Chap.  5.  Consider  an  urn  with 
eight  balls:  four  of  them  are  white,  three  are  black,  and  one  is  red.  Now,  two  balls 
are  drawn  from  the  urn.  The  random  variables  X  and  7  are  defined  as  follows: 

1  black  ball 

X  =  2  red  ball  in  the  first  draw 

3  white  ball 


7  = 


1  black  ball 

2  red  ball  in  the  second  draw 

3  white  ball. 
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(a)  When  are  X  and  Y  independent — when  the  two  balls  are  drawn  with  replacement 
or  without  replacement? 

(b)  Assume  the  balls  are  drawn  such  that  X  and  Y  are  dependent.  Use  the  conditional 
distribution  P(Y  \X)  to  determine  the  joint  PDF  of  X  and  Y. 

(c)  Calculate  E(X),  E(y),  and  p(X,  Y). 


Exercise  7.9  If  X  is  the  amount  of  money  spent  on  food  and  other  expenses  during 
a  day  (in  €)  and  Y  is  the  daily  allowance  of  a  businesswoman,  the  joint  density  of 
these  two  variables  is  given  by 


fxY (x,  y) 


c  (i^L)  if  10  <  JC  <  100,  40  <  y  <  100 
0  elsewhere. 


(a)  Choose  c  such  that  /xfCu  y)  is  a  probability  density  function. 

(b)  Find  the  marginal  distribution  of  X. 

(c)  Calculate  the  probability  that  more  than  €75  are  spent. 

(d)  Determine  the  conditional  distribution  of  Y  given  X. 


Exercise  7.10  Considers  i.i.d.  random  variables  X/  withE(X/)  =  /iandVar  (X/)  = 
a2  and  the  standardized  variable  Y  =  Show  that  E(E)  =  0  and  Var(E)  =  1. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  365 


"Source  Toutenburg,  H.,  Heumann,  C.,  Induktive  Statistik ,  4th  edition,  2007, 
Springer,  Heidelberg 


Probability  Distributions 


We  introduced  the  concept  of  probability  density  and  probability  mass  functions  of 
random  variables  in  the  previous  chapter.  In  this  chapter,  we  are  introducing  some 
common  standard  discrete  and  continuous  probability  distributions  which  are  widely 
used  for  either  practical  applications  or  constructing  statistical  methods  described 
later  in  this  book.  Suppose  we  are  interested  in  determining  the  probability  of  a  cer¬ 
tain  event.  The  determination  of  probabilities  depends  upon  the  nature  of  the  study 
and  various  prevailing  conditions  which  affect  it.  For  example,  the  determination  of 
the  probability  of  a  head  when  tossing  a  coin  is  different  from  the  determination  of 
the  probability  of  rain  in  the  afternoon.  One  can  speculate  that  some  mathematical 
functions  can  be  defined  which  depict  the  behaviour  of  probabilities  under  different 
situations.  Such  functions  have  special  properties  and  describe  how  probabilities 
are  distributed  under  different  conditions.  We  have  already  learned  that  they  are 
called  probability  distribution  functions.  The  form  of  such  functions  may  be  sim¬ 
ple  or  complicated  depending  upon  the  nature  and  complexity  of  the  phenomenon 
under  consideration.  Let  us  first  recall  and  extend  the  definition  of  independent  and 
identically  distributed  random  variables: 

Definition  8.0.1  The  random  variables  X\,  X2, . . . ,  Xn  are  called  independent  and 
identically  distributed  (i.i.d)  if  the  Xi  (i  =  1,2 ,  ,n)  have  the  same  marginal  cumu¬ 
lative  distribution  function  F(x)  and  if  they  are  mutually  independent. 

Example  8. 0. 1  Suppose  a  researcher  plans  a  survey  on  the  weight  of  newborn  babies 
in  a  country.  The  researcher  randomly  contacts  10  hospitals  with  a  maternity  ward 
and  asks  them  to  randomly  select  20  of  the  newborn  babies  (no  twins)  born  in  the  last 
6  months  and  records  their  weights.  The  sample  therefore  consists  of  10  x  20  =  200 
baby  weights.  Since  the  hospitals  and  the  babies  are  randomly  selected,  the  babies’ 
weights  are  therefore  not  known  beforehand.  The  200  weights  can  be  denoted  by  the 
random  variables  X\,  X2, . . . ,  26200-  Note  that  the  weights  Xi  are  random  variables 
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because,  depending  on  the  size  of  the  population,  different  samples  consisting  of 
200  babies  can  be  randomly  selected.  Also,  the  babies’  weights  can  be  seen  as 
stochastically  independent  (an  example  of  stochastically  dependent  weights  would 
be  the  weights  of  twins  if  they  are  included  in  the  sample).  After  collecting  the 
weights  of  200  babies,  the  researcher  has  a  sample  of  200  realized  values  (i.e.  the 
weights  in  grams).  The  values  are  now  known  and  denoted  by  x\,  X2,  . . . ,  V200- 


8.1  Standard  Discrete  Distributions 

First,  we  discuss  some  standard  distributions  for  discrete  random  variables. 


8.1 .1  Discrete  Uniform  Distribution 


The  discrete  uniform  distribution  assumes  that  all  possible  outcomes  have  equal 
probability  of  occurrence.  A  more  formal  definition  is  given  as  follows: 


Definition  8.1.1  A  discrete  random  variable  X  with  k  possible  outcomes  x\ ,  X2,  . . . , 
Xk  is  said  to  follow  a  discrete  uniform  distribution  if  the  probability  mass  function 
(PMF)  of  X  is  given  by 

1 

P(X  =  Xi)  =  V/  =  1,2,...,  k.  (8.1) 

k 


If  the  outcomes  are  the  natural  numbers  jc/  =  i  (i  =  1,2  ,...,&),  the  mean  and  vari¬ 
ance  of  X  are  obtained  as 


E(X)  = 


k  +  1 


Var(X)  = 


(8.2) 

(8.3) 


Example  8.1.1  If  we  roll  a  fair  die,  the  outcomes  “1”,  “2”, . . .,  “6”  have  equal  prob¬ 
ability  of  occurring,  and  hence,  the  random  variable  X  “number  of  dots  observed  on 
the  upper  surface  of  the  die”  has  a  uniform  discrete  distribution  with  PMF 


1 

P(X  =  i)  =  ~,  for  all  /  =  1,2, . . . ,  6. 

6 

The  mean  and  variance  of  X  are 


E(X)  = 


Var(X)  = 


6+1 


=  3.5, 

-  1)  =  35/12. 
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Fig.  8.1  Frequency 
distribution  of  1000 
generated  discrete  uniform 
random  numbers  with 
possible  outcomes 
(2,5,8,  10) 


o 


X 


Using  the  function  sample  ()  in  R,  it  is  easy  to  generate  random  numbers  from  a 
discrete  uniform  distribution.  The  following  command  generates  a  random  sample 
of  size  1000  from  a  uniform  distribution  with  the  four  possible  outcomes  2,  5,  8,  10 
and  draws  a  bar  chart  of  the  observed  numbers.  The  use  of  the  set.seedO  function 
allows  to  reproduce  the  generated  random  numbers  at  any  time.  It  is  necessary  to 
use  the  option  replace=TRUE  to  simulate  draws  with  replacement,  i.e.  to  guarantee 
that  a  value  can  occur  more  than  once. 

set . seed(123789) 

x  <-  sample (x=c (2 , 5 , 8 , 10) ,  size=1000,  replace=T, 
prob=c (1/4, 1/4, 1/4, 1/4)) 
barchart (table (x) ,  ylim=c (0 , 300) ) 

A  bar  chart  of  the  frequency  distribution  of  the  1000  sampled  numbers  with  the 
possible  outcomes  (2,  5,  8,  10)  using  the  discrete  uniform  distribution  is  given  in 
Fig.  8.1.  We  see  that  the  1000  generated  random  numbers  are  not  exactly  uniformly 
distributed,  e.g.  the  numbers  5  and  10  occur  more  often  than  the  numbers  2  and  8. 
In  fact,  they  are  only  approximately  uniform.  We  expect  that  the  deviance  from 
a  perfect  uniform  distribution  is  getting  smaller  as  we  generate  more  and  more 
random  numbers  but  will  probably  never  be  zero  for  a  finite  number  of  draws.  The 
random  numbers  reflect  the  practical  situation  that  a  sample  distribution  is  only  an 
approximation  to  the  theoretical  distribution  from  which  the  sample  was  drawn.  More 
details  on  how  to  work  with  random  variables  in  R  are  given  in  Appendix  A. 3. 
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8.1 .2  Degenerate  Distribution 


Definition  8.1.2  A  random  variable  X  has  a  degenerate  distribution  at  a,  if  a  is 

the  only  possible  outcome  with  P(X  =  a)  =  1.  The  CDF  in  such  a  case  is  given  by 


F(x)  = 


0  if  x  <  a 
1  if  x  >  a. 


Further,  E(X)  =  a  and  Var(X)  =  0. 

The  degenerate  distribution  indicates  that  there  is  only  one  possible  fixed  outcome, 
and  therefore,  no  randomness  is  involved.  It  follows  that  we  need  at  least  two  different 
possible  outcomes  to  have  randomness  in  the  observations  of  a  random  variable  or 
random  experiment.  The  Bernoulli  distribution  is  such  a  distribution  where  there  are 
only  two  outcomes,  e.g.  success  and  failure  or  male  and  female.  These  outcomes  are 
usually  denoted  by  the  values  “0”  and  “1”. 


8.1 .3  Bernoulli  Distribution 


Definition  8.1.3  A  random  variable  X  has  a  Bernoulli  distribution  if  the  PMF  of  X 
is  given  as 


P(X  =  x)  = 

The  cumulative  distribution  function  (CDF)  of  X  is 


p  if  v  =  1 

1  —  p  if  jv  =  0. 


F(x)  = 


0  if  v  <  0 

1  —  p  if  0  <  v  <  1 
1  if  v  >  1 . 


The  mean  (expectation)  and  variance  of  a  Bernoulli  random  variable  are  calculated 
as 


E(X)  =  1  •  p  +  0  •  (1  —  P)  =  P 


(8-4) 


and 

Var(X)  =  (1  -  p)2  ■  p  +  (0  -  p )2  •  (1  -  p)  =  p(  1  -  p),  (8.5) 


respectively. 


A  Bernoulli  distribution  is  useful  when  there  are  only  two  possible  outcomes,  and 
our  interest  lies  in  any  of  the  two  outcomes,  e.g.  whether  a  customer  buys  a  certain 
product  or  not,  or  whether  a  hurricane  hits  an  island  or  not.  The  outcome  of  an  event 
A  is  usually  coded  as  1  which  occurs  with  probability  p.  If  the  event  of  interest  does 
not  occur,  i.e.  the  complementary  event  A  occurs,  the  outcome  is  coded  as  0  which 
occurs  with  probability  1  —  p.  So  p  is  the  probability  that  the  event  of  interest  A 


occurs. 
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Example  8.1.2  A  company  organizes  a  raffle  at  an  end-of-year  function.  There  are 
300  lottery  tickets  in  total,  and  50  of  them  are  marked  as  winning  tickets.  The  event 
A  of  interest  is  “ticket  wins”  (coded  as  X  =  1),  and  the  probability  p  of  having  a 
winning  ticket  is  a  priori  (i.e.  before  any  lottery  ticket  has  been  drawn) 


P(X  =  1)  = 


50  1  250  5 

- =  -  =  p  and  P(X  =  0)  =  - =  -  =  1  —  p. 

300  6  ^  300  6 


According  to  (8.4)  and  (8.5),  the  mean  (expectation)  and  variance  of  X  are 

1  15  5 

E(X)  =  -  and  Var(X)  =  -  •  -  =  — respectively. 

6  6  6  36 


8.1 .4  Binomial  Distribution 

Consider  n  independent  trials  or  repetitions  of  a  Bernoulli  experiment.  In  each  trial 
or  repetition,  we  may  observe  either  A  or  A.  At  the  end  of  the  experiment,  we  have 
thus  observed  A  between  0  and  n  times.  Suppose  we  are  interested  in  the  probability 
of  A  occurring  k  times,  then  the  binomial  distribution  is  useful. 

Example  8.1.3  Consider  a  coin  tossing  experiment  where  a  coin  is  tossed  ten  times 
and  the  event  of  interest  is  A  =  “head”.  The  random  variable  X  “number  of  heads  in 
10  experiments”  has  the  possible  outcomes  k  =  0,  1,  . . . ,  10.  A  question  of  interest 
may  be:  What  is  the  probability  that  a  head  occurs  in  7  out  of  10  trials;  or  in  5  out 
of  10  trials?  We  assume  that  the  order  in  which  heads  (and  tails)  appear  is  not  of 
interest,  only  the  total  number  of  heads  is  of  interest. 

Questions  of  this  kind  are  answered  by  the  binomial  distribution.  This  distribution 
can  either  be  motivated  as  a  repetition  of  n  Bernoulli  experiments  (as  in  the  above 
coin  tossing  example)  or  by  the  urn  model  (see  Chap.  5):  assume  there  are  M  white 
and  N  —  M  black  balls  in  the  urn.  Suppose  n  balls  are  drawn  randomly  from  the  urn, 
the  colour  of  the  ball  is  recorded  and  the  ball  is  placed  back  into  the  urn  (sampling  with 
replacement).  Let  A  be  the  event  of  interest  that  a  white  ball  is  drawn  from  the  urn. 
The  probability  of  A  is  p  =  M/N  (the  probability  of  drawing  a  black  ball  is  1  —  p  = 
(N  —  M) /N).  Since  the  balls  are  drawn  with  replacement,  these  probabilities  do  not 
change  from  draw  to  draw.  Further,  let  X  be  the  random  variable  counting  the  number 
of  white  balls  drawn  from  the  urn  in  the  n  experiments.  Since  the  order  of  the  resulting 
colours  of  balls  is  not  of  interest  in  this  case,  there  are  Q)  combinations  where  k 
balls  are  white  and  n  —  k  balls  are  black.  Since  the  balls  are  drawn  with  replacement, 
every  outcome  of  the  n  experiments  is  independent  of  all  others.  The  probability  that 
X  =  k,  k  =  0,  1,  ...  ,n,  can  therefore  be  calculated  as 

P{X=k)  =  r\pk{\-  p)n~k  (fc  =  0,l,...,n).  (8.6) 

Please  note  that  we  can  use  the  product  pk(  1  —  p)n~k  because  the  draws  are  inde¬ 
pendent.  The  binomial  coefficient  is  necessary  to  count  the  number  of  possible 
orders  of  the  black  and  white  balls. 
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Definition  8.1.4  A  discrete  random  variable  X  is  said  to  follow  a  binomial  distribu¬ 
tion  with  parameters  n  and  p  if  its  PMF  is  given  by  (8.6).  We  also  write  X  ~  B(n;  p ). 
The  mean  and  variance  of  a  binomial  random  variable  X  are  given  by 

E(X)  =  np , 

Var(X)  =  np{  1  —  p). 

Remark  8.1.1  A  Bernoulli  random  variable  is  therefore  B(  1;  p)  distributed. 

Example  8.1.4  Consider  an  unfair  coin  where  the  probability  of  observing  a  tail  (T) 
is  p(T)  =  0.6.  Let  us  denote  tails  by  “1”  and  heads  by  “0”.  Suppose  the  coin  is 
tossed  three  times.  In  total,  there  are  the  23  =  8  following  possible  outcomes: 


(8.7) 

(8.8) 


Outcome 

X  =  * 

1 

1 

1 

3 

1 

1 

0 

2 

1 

0 

1 

2 

0 

1 

1 

2 

1 

0 

0 

1 

0 

1 

0 

1 

0 

0 

1 

1 

0 

0 

0 

0 

Note  that  the  first  outcome,  viz.  (1,  1,  1)  leads  to  v  =  3,  the  next  3  outcomes,  viz., 
(1,  1,0),  (1,0,  1),  (0,  1,  1)  obtained  by  (=  Q)  lead  to  v  =  2,  the  next  3  outcomes, 

viz.,  (1,  0,  0),  ((0,  1,0),  (0,  0,  1)  obtained  by  (=  (3))  lead  to  v  =  1,  and  the  last 

outcome,  viz.  (0,  0,  0)  obtained  by  (=  Q)  leads  to  x  =  0.  We  can,  for  example, 
calculate 

0.62(1  —  0.6)1  =  0.432  (or  43.2%). 

Further,  the  mean  and  variance  of  X  are 

E(X)  =  np  =  3  •  0.6  =  1.8,  and  Var(X)  =  np(l  —  p)  =  3  •  0.6  •  0.4  =  0.72. 

Functions  for  the  binomial  distribution,  as  well  as  many  other  distributions,  are 
implemented  in  R.  For  each  of  these  distributions,  we  can  easily  determine  the  density 
function  (PMF,  PDF)  for  given  values  and  parameters,  determine  the  CDF,  calculate 
quantiles  and  draw  random  numbers.  Appendix  A. 3  gives  more  details.  Nevertheless, 
we  illustrate  the  concept  of  dealing  with  distributions  in  R  in  the  following  example. 

Example  8.1.5  Suppose  we  roll  an  unfair  die  50  times  with  the  probability  of  a  tail 
pt3i a  =  0.6.  We  thus  deal  with  a  5(50,  0.6)  distribution  which  can  be  plotted  using 
the  dbinom  command.  The  prefix  d  stands  for  “density”. 


P(X  =  2)  = 


0 
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Fig.  8.2  PMFofa 
5(50,  0.6)  distribution 


0  10  20  30  40  50 


X 


n  <-  50 

p  <-  0.6 

k  <-  0 :n 

pmf  <-  dbinom(k,n,p) 
plot (k, pmf,  type=h) 

A  plot  ofthe  PMF  of  a  binomial  distribution  withft  =  50and  p  =  0.6(i.e.  5(50,  0.6)) 
is  given  in  Fig.  8.2. 

Note  that  we  can  also  calculate  the  CDF  with  R.  We  can  use  the  pbinom  (x ,  n ,  p) 
command,  where  the  prefix  p  stands  for  probability,  to  calculate  the  CDF  at  any 
point.  For  example,  suppose  we  are  interested  in  P(X  >  30)  =  1  —  F  (29),  that  is 
the  probability  of  observing  thirty  or  more  tails;  then  we  write 


l-pbinom(29 ,50,0.6) 

1  l  V 

[1]  0.5610349 

Similarly,  we  can  determine  quantiles.  For  instance,  the  80  %  quantile  q  which 
describes  that  P(X  <  q)  >  0.8  can  be  obtained  by  the  qbinom(q,n,p)  command 
as  follows: 


qbinom (0.8,50,0.6) 

Ipjy* 

II  m 

[1]  33 

If  we  want  to  generate  100  random  realizations  from  a  5(50,  0.6)  distribution  we 
can  use  the  rbinom  command. 

rbinomdOO ,  50 , 0 . 6) 

The  binomial  distribution  has  some  nice  properties.  One  of  them  is  described  in 
the  following  theorem: 
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Theorem  8.1.1  Let  X  ~  B(n\  p)  and  Y  ~  B(m\  p)  and  assume  that  X  and  Y  are 
(stochastically)  independent.  Then 

X  +  Y  ~  B(n  +  m;  p).  (8.9) 

This  is  intuitively  clear  since  we  can  interpret  this  theorem  as  describing  the  addi¬ 
tive  combination  of  two  independent  binomial  experiments  with  n  and  m  trials,  with 
equal  probability  p ,  respectively.  Since  every  binomial  experiment  is  a  series  of  inde¬ 
pendent  Bernoulli  experiments,  this  is  equivalent  to  a  series  of  n  +  m  independent 
Bernoulli  trials  with  constant  success  probability  p  which  in  turn  is  equivalent  to  a 
binomial  distribution  with  n  +  m  trials. 


8.1 .5  Poisson  Distribution 

Consider  a  situation  in  which  the  number  of  events  is  very  large  and  the  probability 
of  success  is  very  small:  for  example,  the  number  of  alpha  particles  emitted  by  a 
radioactive  substance  entering  a  particular  region  in  a  given  short  time  interval.  Note 
that  the  number  of  emitted  alpha  particles  is  very  high  but  only  a  few  particles  are 
transmitted  through  the  region  in  a  given  short  time  interval.  Some  other  examples 
where  Poisson  distributions  are  useful  are  the  number  of  flu  cases  in  a  country  within 
one  year,  the  number  of  tropical  storms  within  a  given  area  in  one  year,  or  the  number 
of  bacteria  found  in  a  biological  investigation. 

Definition  8.1.5  A  discrete  random  variable  X  is  said  to  follow  a  Poisson  distribution 
with  parameter  A  >  0  if  its  PMF  is  given  by 

A* 

P(X  =  x)  =  —  exp  (—A)  (x  =0,1,2,.. .).  (8.10) 

x\ 

We  also  write  X  ~  Po( A).  The  mean  and  variance  of  a  Poisson  random  variable  are 
identical: 

E(X)  =  Var(X)  =  A. 


Example  8.1.6  Suppose  a  country  experiences  X  =  4  tropical  storms  on  average  per 
year.  Then  the  probability  of  suffering  from  only  two  tropical  storms  is  obtained  by 
using  the  Poisson  distribution  as 


P(X  =  2)  = 


A*  42 

—  exp(— A)  =  —  exp(— 4)  =  0.146525. 
x\  21 


If  we  are  interested  in  the  probability  that  not  more  than  2  storms  are  experienced,  then 
we  can  apply  rules  (7.7)-(7.13)  from  Chap.  7:  P(X  <  2)  =  P(X  =  2)  +  P(X  = 
1)  +  p(x  =  0)  =  F(2)  =  0.2381033.  We  can  calculate  P(X  =  1)  and  P(X  =  0) 
from  (8.10)  or  using  R.  Similar  to  Example  8.1.5,  we  use  the  prefix  d  to  obtain 
the  PMF  and  the  prefix  p  to  work  with  the  CDF,  i.e.  we  can  use  dpois(x,A)  and 
ppois(x,A)  to  determine  P(X  =  x)  and  P(X  <  x),  respectively. 
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dpois(2,4)  +  dpois(l,4)  +  dpois(0,4) 
[1]  0.2381033 
ppois (2,4) 

[1]  0.2381033 


8.1 .6  Multinomial  Distribution 


We  now  consider  random  experiments  where  k  distinct  or  disjoint  events  A  \ ,  A  2 ,  . . . , 
Ak  can  occur  with  probabilities  pi,  P2,  •  •  • ,  Pk>  respectively,  with  the  restriction 
Xj=i  Pj  =  1-  For  example,  if  eight  parties  compete  in  a  political  election,  we  may 
be  interested  in  the  probability  that  a  person  votes  for  party  Ay,  j  =  1,  2, . . . ,  8. 
Similarly  one  might  be  interested  in  the  probability  whether  tuberculosis  is  detected 
in  the  lungs  (Ai),  in  other  organs  (A2),  or  both  (A3).  Practically,  we  often  use  the 
multinomial  distribution  to  model  the  distribution  of  categorical  variables.  This  can 
be  interpreted  as  a  generalization  of  the  binomial  distribution  (where  only  two  distinct 
events  can  occur)  to  the  situation  where  more  than  two  events  or  outcomes  can 
occur.  If  the  experiment  is  repeated  n  times  independently,  we  are  interested  in  the 
probability  that 


A 1  occurs  -times,  A  2  occurs  ^2-times,  ...,  Ak  occurs  ^^-times 

with  nj  —  n-  Since  several  events  can  occur,  the  outcome  of  one  (of  the  n) 

experiments  is  conveniently  described  by  binary  indicator  variables.  Let  V/y ,  i  = 
1 ,  ,n,  j  =  1  ,...,&,  denote  the  event  “Ay  is  observed  in  experiment  i.e. 


1  if  Ay  occurs  in  experiment  i 
0  if  Ay  does  not  occur  in  experiment  i 


with  probabilities  P(Vij  =  1)  =  Pj,  j  =  1,2 then,  the  outcome  of  one 
experiment  is  a  vector  of  length  k , 


with  “1”  being  present  in  only  one  position,  i.e.  in  position  j,  if  Ay  occurs  in  exper¬ 
iment  i.  Now,  define  (for  each  j  =  1,  ...,&)  Xj  =  Vyy .  Then,  Xj  is  counting 
how  often  event  Ay  was  observed  in  the  n  independent  experiments  (i.e.  how  often 
Vij  was  1  in  the  n  experiments). 


Definition  8.1.6  The  random  vector  X  =  (X\,  X2, . . . ,  Xk)  is  said  to  follow  a 

multinomial  distribution  if  its  PMF  is  given  as 


P(X  1  =  n  1,  X2  =  ri2,  . . . ,  Xk  =  rik)  = 


n\ 


n\\ri2\  •  •  •  kik • 


n\  no 

P\  Pi 


ni 

Pt 


(8.11) 


%  1  ni  =  n  and  Zj 

pic).  The  mean  of  X  is  the  (component- wise)  vector 


with  the  restrictions  Zj=i  ni  =  n  and  Z;=i  Pj  =  1-  We  also  write  X  ~  M (n:  p\ , 


E(X)  =  (E(Xi),E(X2),...,E(Xt)) 
=  (jipi,np2,  ...,npk). 
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The  (/,  j  )th  element  of  the  covariance  matrix  V  (X)  is 

npi(  1  -  Pi)  if  i  =  j 


Co  \(XitXi)  = 


-npipj  if  i  ^  j. 


Remark  8.1.2  Due  to  the  restriction  that  Xy=i  nj  =  V  =  n,  X\,  . . . ,  are 

not  stochastically  independent  which  is  reflected  by  the  negative  covariance.  This 
is  also  intuitively  clear:  if  one  Xj  gets  higher,  another  Xy,  j  /  /,  has  to  become 
lower  to  satisfy  the  restrictions. 


We  use  the  multinomial  distribution  to  describe  the  randomness  of  categorical 
variables.  Suppose  we  are  interested  in  the  variable  “political  party”;  there  might 
be  eight  political  parties,  and  we  could  thus  summarize  this  variable  by  eight  binary 
variables,  each  of  them  describing  the  event  of  party  A  j,  j  =  1 ,  2,  . . . ,  8,  being  voted 
for.  In  this  sense,  X  =  (X\,  X2,  . . . ,  Xs)  follows  a  multinomial  distribution. 


Example  8.1.7  Consider  a  simple  example  of  the  urn  model.  The  urn  contains  50 
balls  of  three  colours:  25  red  balls,  15  white  balls,  and  10  black  balls.  The  balls  are 
drawn  from  the  urn  with  replacement.  The  balls  are  placed  back  into  the  urn  after 
every  draw,  which  means  the  draws  are  independent.  Therefore,  the  probability  of 
drawing  a  red  ball  in  every  draw  is  p\  =  ^  =0.5.  Analogously,  p2  =  0.3  (for  white 
balls)  and  p?>  =0.2  (for  black  balls).  Consider  n  =  4  draws.  The  probability  of  the 
random  event  of  drawing  “2  red  balls,  1  white  ball,  and  1  black  ball”  is: 

P(X  1  =  2,  X2  =  1,  x3  =  1)  =  d2(0.5)2(0.3)‘(0.2)1  =  0.18.  (8.12) 

We  would  have  obtained  the  same  result  in  R  using  the  dmultinom  function: 

dmultinom(c (2 , 1, 1) ,prob=c(0.5,0.3,0.2)) 


This  example  demonstrates  that  the  multinomial  distribution  relates  to  an  exper¬ 
iment  with  replacement  and  without  considering  the  order  of  the  draws.  Instead  of 
the  urn  model,  consider  another  example  where  we  may  interpret  these  three  proba¬ 
bilities  as  probabilities  of  voting  for  candidate  Aj,  j  =  1,  2,  3,  in  an  election.  Now, 
suppose  we  ask  four  voters  about  their  choice,  then  the  probability  of  candidate  A\ 
receiving  2  votes,  candidate  A 2  receiving  1  vote,  and  candidate  A3  receiving  1  vote 
is  18  %  as  calculated  in  (8.12). 

Remark  8.1.3  In  contrast  to  most  of  the  distributions,  the  CDF  of  the  multinomial 
distribution,  i.e.  the  function  calculating  P(X\  <  x\,  X2  <  X2,  . . . ,  X^  <  Xk),  is  not 
contained  in  the  base  ^-distribution.  Please  note  that  for  k  =  2,  the  multinomial 
distribution  reduces  to  the  binomial  distribution. 
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8.1 .7  Geometric  Distribution 

Consider  a  situation  in  which  we  are  interested  in  determining  how  many  independent 
Bernoulli  trials  are  needed  until  the  event  of  interest  occurs  for  the  first  time.  For 
instance,  we  may  be  interested  in  how  many  tickets  to  buy  in  a  raffle  until  we  win 
for  the  first  time,  or  how  many  different  drugs  to  try  to  successfully  tackle  a  severe 
migraine,  etc.  The  geometric  distribution  can  be  used  to  determine  the  probability 
that  the  event  of  interest  happens  at  the  kth  trial  for  the  first  time. 

Definition  8.1.7  A  discrete  random  variable  X  is  said  to  follow  a  geometric  distri¬ 
bution  with  parameter  p  if  its  PMF  is  given  by 

P(X  =  k)  =  p(\  -  p)k~\  k=  1,2,3,...  (8.13) 

The  mean  (expectation)  and  variance  are  given  by  E(X)  =  1/p  and  Var(X)  = 
1  / p ( 1  / p  —  1),  respectively. 

Example  8.1.8  Let  us  consider  an  experiment  where  a  coin  is  tossed  until  “head”  is 
obtained  for  the  first  time.  The  probability  of  getting  a  head  is  p  =  0.5  for  each  toss. 
Using  (8.13),  we  can  determine  the  following  probabilities: 

P(X  =  1)  =  0.5 

P(X  =  2)  =  0.5(1  -0.5)  =  0.25 
P(X  =  3)  =  0.5(1  -  0.5)2  =  0.125 
P(X  =  4)  =  0.5(1  -  0.5)3  =  0.0625 


Using  the  command  structure  for  obtaining  PMF’s  in  R  (Appendix  A  as  well  as 
Examples  8.1.5  and  8.1.6),  we  can  determine  the  latter  probability  of  P(X  =  4)  as 
follows: 

dgeom(3,0 .5) 


Note  that  the  definition  of  X  in  R  slightly  differs  from  our  definition.  In  R,  k  is 
the  number  of  failures  before  the  first  success.  This  means  we  need  to  specify  k  —  1 
in  the  dgeom  function  rather  than  k.  The  mean  and  variance  for  this  setting  are 


E(X)  = 


1 

05 


Var(X)  =  —  ( —  -  1  | 
0.5  \0.5  ) 


8.1 .8  Hypergeometric  Distribution 

We  can  again  use  the  urn  model  to  motivate  another  distribution,  the  hypergeometric 
distribution.  Consider  an  urn  with  We  randomly  draw  n  balls  without  replacement, 
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M 

N  -  M 

white  balls 

black  balls 

N 

total  balls 

i.e.  we  do  not  place  a  ball  back  into  the  urn  once  it  is  drawn.  The  order  in  which  the 
balls  are  drawn  is  assumed  to  be  of  no  interest;  only  the  number  of  drawn  white  balls 
is  of  relevance.  We  define  the  following  random  variable 


X  :  “number  of  white  balls  (x)  among  the  n  drawn  balls”. 


To  be  more  precise,  among  the  n  drawn  balls,  v  are  white  and  n  —  x  are  black.  There 
are  (^)  possibilities  to  choose  v  white  balls  from  the  total  of  M  white  balls,  and 

analogously,  there  are  iNn^)  possibilities  to  choose  (n  —  x)  black  balls  from  the 
total  of  N  —  M  black  balls.  In  total,  we  draw  n  out  of  N  balls.  Recall  the  probability 
definition  of  Laplace  as  the  number  of  simple  favourable  events  divided  by  all  possible 
events.  The  number  of  combinations  for  all  possible  events  is  (^);  the  number  of 

favourable  events  is  (^)  (NnZ^)  because  we  draw,  independent  of  each  other,  v  out 
of  M  balls  and  n  —  x  out  of  N  —  M  balls.  Hence,  the  PMF  of  the  hypergeometric 
distribution  is 


P(X  =  x) 


(8.14) 


for  v  G  {max(0,  n  —  (N  —  M)),  . . . ,  min(n,  M)}. 


Definition  8.1.8  A  random  variable  X  is  said  to  follow  a  hypergeometric  distribu¬ 
tion  with  parameters  n,  M,  N,  i.e.  X  ~  H(n,  M,  N),  if  its  PMF  is  given  by  (8.14). 


Example  8.1.9  The  German  national  lottery  draws  6  out  of  49  balls  from  a  rotating 
bowl.  Each  ball  is  associated  with  a  number  between  1  and  49.  A  simple  bet  is  to 
choose  6  numbers  between  1  and  49.  If  3  or  more  chosen  numbers  correspond  to 
the  numbers  drawn  in  the  lottery,  then  one  wins  a  certain  amount  of  money.  What 
is  the  probability  of  choosing  4  correct  numbers?  We  can  utilize  the  hypergeometric 
distribution  with  v  =4 ,  M  =  6,  N  =  49,  and  n  =  6  to  calculate  such  probabilities. 
The  interpretation  is  that  we  “draw”  (i.e.  bet  on)  4  out  of  the  6  winning  balls  and 
“draw”  (i.e.  bet  on)  another  2  out  of  the  remaining  43  (49  —  6)  losing  balls.  In  total, 
we  draw  6  out  of  49  balls.  Calculating  the  number  of  the  favourable  combinations  and 
all  possible  combinations  leads  to  the  application  of  the  hypergeometric  distribution 
as  follows: 


P(X  =4) 


~  0.001  (or 


We  would  have  obtained  the  same  results  using  the  dhyper  command.  Its  arguments 
are  x ,  M ,  N ,  n,  and  thus,  we  specify 

dhyper (4,6,43,6) 
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Fig. 8.3  The  #(6,43,  6) 
distribution 


x 


The  H(6,  43,  6)  distribution  is  also  visualized  in  Fig.  8.3.  It  is  evident  that  the 
cumulative  probability  of  choosing  2  or  fewer  correct  numbers  is  greater  than  0.9 
(or  90  %),  but  it  is  very  unlikely  to  have  3  or  more  numbers  right.  This  may  explain 
why  the  national  lottery  pays  out  money  only  for  3  or  more  correct  numbers. 


8.2  Standard  Continuous  Distributions 

Now,  we  discuss  some  standard  probability  distributions  of  (absolute)  continuous 
random  variables.  Characteristics  of  continuous  random  variables  are  that  the  number 
of  possible  outcomes  is  uncountably  infinite  and  that  they  have  a  continuous  distribu¬ 
tion  function  F(x).  It  follows  that  the  point  probabilities  are  zero,  i.e.  P(X  =  x)  =  0. 

/JC 

_  f(t)dt. 

\-/\j 


8.2.1  Continuous  Uniform  Distribution 

A  continuous  analogue  to  the  discrete  uniform  distribution  is  the  continuous  uniform 
distribution  on  a  closed  interval  in  R. 


Definition  8.2.1  A  continuous  random  variable  X  is  said  to  follow  a  (continuous) 
uniform  distribution  in  the  interval  [a ,  b] ,  i.e.  X  ~  U  (a ,  b) ,  if  its  probability  density 
function  (PDF)  is  given  by 


/(*)  = 


if  a  <  x  <  b  (a  <  b) 
0  otherwise. 


The  mean  and  variance  of  X  ~  U (a,  b)  are 

a  +  b 


E(X)  = 


and  Var(X)  = 


(b  —  a)j 
12 


respectively. 
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Example  8.2.1  Suppose  a  train  arrives  at  a  subway  station  regularly  every  10  min. 
If  a  passenger  arrives  at  the  station  without  knowing  the  timetable,  then  the  waiting 
time  to  catch  the  train  is  uniformly  distributed  with  density 


fix)  = 


if  0  <  jc  <  10 
0  otherwise. 


The  “average”  waiting  time  is  E(X)  =  (10  +  0)/2  =  5  min.  The  probability  of  wait¬ 
ing  for  the  train  for  less  than  3  min  is  obviously  0.3  (or  30  %)  which  can  be  calculated 
in  R  using  the  punif  (x , a,b)  command  (see  also  Appendix  A.3): 


punif (3,0, 10) 


8.2.2  Normal  Distribution 

The  normal  distribution  is  one  of  the  most  important  distributions  used  in  statistics. 
The  name  was  given  by  Carl  Friedrich  Gauss  (1777-1 855),  a  German  mathematician, 
astronomer,  geodesist,  and  physicist  who  observed  that  measurements  in  geodesy  and 
astronomy  randomly  deviate  in  a  symmetric  way  from  their  true  values.  The  normal 
distribution  is  therefore  also  often  called  a  Gaussian  distribution. 


Definition  8.2.2  A  random  variable  X  is  said  to  follow  a  normal  distribution  with 
parameters  p  and  a2  if  its  PDF  is  given  by 


fix)  = 


(x  -  /i)2\ 
2a2  )’ 


—oo  <  v  <  oo,  —oo  <  p  <  oo,  a 


2 


We  write  X  ~  N(p,  a2).  The  mean  and  variance  of  X  are 

E(X)  =  p;  and  Var(X)  =  a2, 


>  0. 
(8.15) 


respectively.  If  p  =  0  and  a2  =  1,  then  X  is  said  to  follow  a  standard  normal 
distribution,  X  ~  A(0,  1).  The  PDF  of  a  standard  normal  distribution  is  given  by 

v2 

— );  —oo  <  x  <  oo. 

2 


The  density  of  a  normal  distribution  has  its  maximum  (see  Fig.  8.4)  at  x  —  p. 
The  density  is  also  symmetric  around  p.  The  inflexion  points  of  the  density  are  at 
(p  —  a)  and  (p  +  a)  (Fig.  8.4).  A  lower  a  indicates  a  higher  concentration  around 
the  mean  p.  A  higher  a  indicates  a  flatter  density  (Fig.  8.5). 

The  cumulative  distribution  function  of  X  ~  N(p,  a2)  is 


/X 

(p(t)dt  (8.16) 

-OO 


which  is  often  denoted  as  <P(x).  The  value  of  d>(x)  for  various  values  of  v  can  be 
obtained  in  R  following  the  rules  introduced  in  Appendix  A.3.  For  example, 


8.2  Standard  Continuous  Distributions 
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Fig.  8.4  PDF  of  a  normal  distribution* 


Fig.  8.5  PDF  of  N(0,  2),  N(0,  1)  and  N(0,  0.5)  distributions* 


pnorm(1.96,  mean  =0,  sd  =  1) 


calculates  0(1.96)  as  approximately  0.975.  This  means,  for  a  standard  normal  dis¬ 
tribution  the  probability  P(X  <  1.96)  ~  0.975. 

Remark  8.2.1  There  is  no  explicit  formula  to  solve  the  integral  in  Eq.  (8.16).  It  has 
to  be  solved  by  numerical  (or  computational)  methods.  This  is  the  reason  why  CDF 
tables  are  presented  in  almost  all  statistical  textbooks,  see  Table  C.l  in  Appendix  C. 

Example  8.2.2  An  orange  farmer  sells  his  oranges  in  wooden  boxes.  The  weights  of 
the  boxes  vary  and  are  assumed  to  be  normally  distributed  with  p  =  15  kg  and  a2  = 
|  kg2.  The  farmer  wants  to  avoid  customers  being  unsatisfied  because  the  boxes  are 
too  low  in  weight.  He  therefore  asks  the  following  question:  What  is  the  probability 
that  a  box  with  a  weight  of  less  than  13  kg  is  sold?  Using  the  pnorm(x,  /i,  a)  command 
in  R,  we  get 
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pnorm(13 , 15 ,  sqrt  (9/4)  ) 
[1]  0.09121122 


To  calculate  the  probability  in  Example  8.2.2  manually,  we  first  have  to  introduce 
some  theoretical  results. 

Calculation  rules  for  normal  random  variables. 


Let  X  ~  N(/jl,  cr2).  Using  the  transformation 


X-  Li 

Z  =  - E  ~  iV(0,  1), 


(8.17) 


(7 


every  normally  distributed  random  variable  can  be  transformed  into  a  standard  nor¬ 
mal  random  variable.  We  call  this  transformation  the  Z-transformation.  We  can  use 
this  transformation  to  derive  convenient  calculation  rules.  The  probability  for  X  <  b 
is 

X  —  /i  b  —  ii 


p<x  <  j,)  =  p  HL-f:  <  =  P  (z  <  =  *  (t-E\ 


(8.18) 


a  a 

Consequently,  the  probability  for  X  >  a  is 

P(X  >  a)  =  1  -  P(X  <  a)  =  1-0 


(8.19) 


The  probability  that  X  realizes  a  value  in  the  interval  [a,  b ]  is 

.  a  —  a  b  —  Li\  (b  —  Li\  l  a  —  a 

P(a  <  X  <b)  =  P  I  - -  <  Z  <  - -  1=01  - -  I  -  0  1  P 


(: 


a 


a 


(8.20) 

Because  of  the  symmetry  of  the  probability  density  function  <j)(x)  around  its  mean  0, 
the  following  equation  holds  for  the  distribution  function  0(v)  of  a  standard  normal 
random  variable  for  any  value  a : 

0(-a)  =  1  -  0(a).  (8.21) 

It  follows  that  P(— a  <  Z  <  a)  =  2  •  0(a)  —  1,  see  also  Fig.  8.6. 


Example  8.2.3  Recall  Example  8.2.2  where  a  farmer  sold  his  oranges.  He  was  inter¬ 
ested  in  P(X  <  13)  for  X  ~  N(15,  9/4).  Using  (8.17),  we  get 


P(X  <  13)  =  0 


13  -  15 

3 

2 


H) 


=  01 - 1  =  1-0 


(0 


0.091  (or  9.1%) 


To  obtain  0(4/3)  ^  90.9 %,  we  could  either  use  R  (pnorm(4/3))  or  use 
Table  C.l. 
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Fig.  8.6  Distribution  function  of  the  standard  normal  distribution 


Distribution  of  the  Arithmetic  Mean. 


Assume  that  X  ~  N((i,  a2).  Consider  a  random  sample  X  =  (Xi,  X2,  . . . ,  Xn)  of 
independent  and  identically  distributed  random  variables  X/  with  X/  ~  N{ji,  a2). 
Then,  the  arithmetic  mean  X  =  ^  X/=i  X/  follows  a  normal  distribution  with  mean 


and  variance 


E(X)  = 


-Ve  (Xi)  =  /JL 

11  A 
1  =  1 


Var(X)  = 


XVar(v) 

i= 1 


(8.22) 


where  Cov(X/,  X j)  =  0  for  i  /  j .  In  summary,  we  get 


Remark  8.2.2  In  fact,  in  Eq.  (8.22),  we  have  used  the  fact  that  the  sum  of  normal 
random  variables  also  follows  a  normal  distribution,  i.e. 

(Xi  +  X2)  ~  N  (/ii  +  fi2 >  <^1  +  ^2)  • 

This  result  can  be  generalized  to  n  (not  necessarily  identically  distributed  but  inde¬ 
pendent)  normal  random  variables.  In  fact,  it  holds  that  if  Xi,  X2,  . . . ,  Xn  are  inde¬ 
pendent  normal  variables  with  means  fi\,  i±2,  . . . ,  iin  and  variances  a2,  cr2,  . . . ,  a2, 
then  for  any  real  numbers  a\ ,  <22,  •  •  • ,  an,  it  holds  that 

(a\X\  -\-a2X2  H - h  anXn)  ~  N  \a\ii\  +  «2M2  H - anfin,a2a2  +a2a2  H - <2^)  . 
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In  general,  it  cannot  be  taken  for  granted  that  the  sum  of  two  random  variables 
follows  the  same  distribution  as  the  two  variables  themselves.  As  an  example,  con¬ 
sider  the  sum  of  two  independent  uniform  distributions  with  X\  ~  U[ 0,  10]  and 
X2  -  U[ 20,  30].  It  holds  that  E(Xi  +  X2)  =  E(Xi)  +  E(X2)  and  Var(Xi  +  X2)  = 
Var(Xi)  +  Var(X2),  but  X\  +  X2  is  obviously  not  uniformly  distributed. 


8.2.3  Exponential  Distribution 

The  exponential  distribution  is  useful  in  many  situations,  for  example  when  one  is 
interested  in  the  waiting  time,  or  lifetime,  until  an  event  of  interest  occurs.  If  we 
assume  that  the  future  lifetime  is  independent  of  the  lifetime  that  has  already  taken 
place  (i.e.  no  “ageing”  process  is  working),  the  waiting  times  can  be  considered  to 
be  exponentially  distributed. 


Definition  8.2.3  A  random  variable  X  is  said  to  follow  an  exponential  distribution 
with  parameter  A  >  0  if  its  PDF  is  given  by 


fix)  = 


Aexp(— Xx)  if  x>0 
0  otherwise. 


(8.23) 


We  write  X  ~  Exp( A).  The  mean  and  variance  of  an  exponentially  distributed  ran¬ 
dom  variable  X  are 


E(X)  = 


1 

A 


1 

and  Var(X)  =  — r, 

Az 


respectively.  The  CDF  of  the  exponential  distribution  is  given  as 


F(x)  = 


1  —  exp(— Xx)  if  v  >  0 
0  otherwise. 


(8.24) 


Note,  that  P(X  >  x)  =  1  —  F(x )  =  exp(— Xx)  (x  >  0).  An  interesting  property  of 
the  exponential  distribution  is  its  memorylessness:  if  time  t  has  already  been  reached, 
the  probability  of  reaching  a  time  greater  than  t  +  A  does  not  depend  on  t.  This  can 
be  written  as 

P(X  >  t  +  A\X  >t)  =  P(X  >  A)  t,A>  0. 


The  result  can  be  derived  using  basic  probability  rules  as  follows: 

P(X  >  t  +  A  and  X  >  t)  P(X  >  t  +  A) 


P(X  >  t  +  A \X  >  t)  = 


P(X  >  t) 
exp[— A(t  +  A)] 


P(X  >  t) 


exp[— A  t] 

=  1  -  F(A)  =  P(X  >  A) 


—  exp[— XA] 


For  example,  suppose  someone  stands  in  a  supermarket  queue  for  t  minutes.  Say 
the  person  forgot  to  buy  milk,  so  she  leaves  the  queue,  gets  the  milk,  and  stands  in 
the  queue  again.  If  we  use  the  exponential  distribution  to  model  the  waiting  time, 
we  say  that  it  does  not  matter  what  time  it  is:  the  random  variable  “waiting  time 
from  standing  in  the  queue  until  paying  the  bill”  is  not  influenced  by  how  much 
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time  has  elapsed  already;  it  does  not  matter  if  we  queued  before  or  not.  Please  note 
that  the  memorylessness  property  is  shared  by  the  geometric  and  the  exponential 
distributions. 

There  is  also  a  relationship  between  the  Poisson  and  the  exponential  distribution: 

Theorem  8.2.1  The  number  of  events  Y  occurring  within  a  continuum  of  time  is 
Poisson  distributed  with  parameter  A  if  and  only  if  the  time  between  two  events  is 
exponentially  distributed  with  parameter  A. 

The  continuum  of  time  depends  on  the  problem  at  hand.  It  may  be  a  second,  a  minute, 
3  months,  a  year,  or  any  other  time  period. 

Example  8.2.4  Let  Y  be  the  random  variable  which  counts  the  “number  of  accesses 
per  second  for  a  search  engine”.  Assume  that  Y  is  Poisson  distributed  with  parameter 
A  =  10  (E(T)  =  10,  Var(F)  =  10).  The  random  variable  X ,  “waiting  time  until  the 
next  access”,  is  then  exponentially  distributed  with  parameter  A  =  10.  We  therefore 
get 

1  1 
E(X)  =  — ,  Var(X)  =  — r. 
v  '  10  102 

In  this  example,  the  continuum  is  1  s.  The  expected  number  of  accesses  per  second 
is  therefore  E(7)  =  10,  and  the  expected  waiting  time  between  two  accesses  is 
E(X)  =  1/10  s.  The  probability  of  experiencing  a  waiting  time  of  less  than  0.1  s  is 

F(0.1)  =  1  -  exp(-Av)  =  1  -  exp(— 10  •  0.1)  «  0.63. 

In  R,  we  can  obtain  the  same  result  as 


pexp(0 . 1 , 10) 

lj|  Vl 

[1]  0.6321206 

8.3  Sampling  Distributions 

All  the  distributions  introduced  in  this  chapter  up  to  now  are  motivated  by  practical 
applications.  However,  there  are  theoretical  distributions  which  play  an  important 
role  in  the  construction  and  development  of  various  statistical  tools  such  as  those 
introduced  in  Chaps.  9-1 1 .  We  call  these  distributions  “sampling  distributions”.  Now, 
we  discuss  the  x2-,  t-9  and  F -distributions. 

8.3.1  %2-Distribution 

Definition  8.3.1  Let  Z\,  Z2  . . . ,  Zn  be  n  independent  and  identically  N(0,  1)- 
distributed  random  variables.  The  sum  of  their  squares,  X/Li  Z2,  is  then  x2- 
distributed  with  n  degrees  of  freedom  and  is  denoted  as  x2-  The  PDF  of  the  x2- 
distribution  is  given  in  Eq.  (C.7)  in  Appendix  C.3. 
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The  x2 -distribution  is  not  symmetric.  A  x2 -distributed  random  variable  can  only 
realize  values  greater  than  or  equal  to  zero.  Figure  8.7a  shows  the  x2->  X2~>  and 
X5 -distributions.  It  can  be  seen  that  the  “degrees  of  freedom”  specify  the  shape  of 
the  distribution.  Their  interpretation  and  meaning  will  nevertheless  become  clearer 
in  the  following  chapters.  The  quantiles  of  the  CDF  of  different  x2 -distributions  can 
be  obtained  in  R  using  the  qchisq(p,df )  command.  They  are  also  listed  in  Table 
C.3  for  different  values  of  n. 


Theorem  8.3.1  Consider  two  independent  random  variables  which  are  x,2  -  and  x2  - 
distributed,  respectively.  The  sum  of  these  two  random  variables  is  x2+m-  distributed. 


An  important  example  of  a  x2 -distributed  random  variable  is  the  sample  variance 


(Sj)  of  an  i.i.d.  sample  of  size  n  from  a  normally  distributed  population,  i.e. 


(n  -  1)S* 


(7A 


xl-i 


(8.25) 


8.3.2  ^-Distribution 

Definition  8.3.2  Let  X  and  Y  be  two  independent  random  variables  where  X  ~ 
N( 0,  1)  and  Y  ~  x2-  The  ratio 

X 

—  ^  in 
n 

follows  a  t -distribution  (Student’s  r -distribution)  with  n  degrees  of  freedom.  The 
PDF  of  the  t -distribution  is  given  in  Eq.  (C.8)  in  Appendix  C.3. 

Figure  8.7b  visualizes  the  t\~,  t$~,  and  ^30 -distributions.  The  quantiles  of  different 
t -distributions  can  be  obtained  in  R  using  the  qt  (p,df )  command.  They  are  also 
listed  in  Table  C.2  for  different  values  of  n. 

An  application  of  the  £ -distribution  is  the  following:  if  we  draw  a  sample  of  size 
n  from  a  normal  population  N(p,  a2)  and  calculate  the  arithmetic  mean  X  and  the 
sample  variance  then  the  following  theorem  holds: 


Theorem  8.3.2  (Student’s  theorem)  Let  X  =  (Xi,  X2, . . . ,  Xn)  with  X/  N 
(/i,  a2).  The  ratio 


^  in  —  1 


(8.26) 


is  then  t -distributed  with  n  —  1  degrees  of  freedom. 
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f(x)  f(x) 


(a)  x2  distributions  (b)  t  distributions 

Fig.  8.7  Probability  density  functions  of  x2  and  t  distributions* 


8.3.3  F-Distribution 


Definition  8.3.3  Let  X  and  Y  be  independent  xh  and  Xn -distributed  random  vari¬ 
ables,  then  the  distribution  of  the  ratio 


X/m 

Y/n 


F, 


m ,  n 


(8.27) 


follows  the  Fisher  F-distribution  with  ( m,n )  degrees  of  freedom.  The  PDF  of  the 
F -distribution  is  given  in  Eq.  (C.9)  in  Appendix  C.3. 


If  X  is  a  Xrdistributed  random  variable,  then  the  ratio  (8.27)  is  F\,n -distributed. 
The  square  root  of  this  ratio  is  tn -distributed  since  the  square  root  of  a  xp -distributed 
random  variable  is  Af(0,  1) -distributed.  If  W  is  F-distributed,  FWjW,  then  l/W  is 
FWjW -distributed.  Figure  8.8  visualizes  the  F5  5,  F540  and  F530  distributions.  The 


Fig.  8.8  Probability  density 
functions  for  different 
F -distributions* 


/0) 
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quantiles  of  different  /^-distributions  can  be  obtained  in  R  using  the  qf  (p ,  df  1 ,  df  2 ) 
command. 


One  application  of  the  F  -distribution  relates  to  the  ratio  of  two  sample  variances  of 


two  independent  samples  of  size  m  and  n ,  where  each  sample  is  an  i.i.d.  sample  from 
a  normal  population,  i.e.  N(/ix,  cr2)  and  N(/j,y,  cr2).  For  the  sample  variances  S\  = 

jFi  2”li(V  -  ^)2  and  sy  =  A  H"=i  (Yi  ~  Y)2  from  the  two  populations,  the 
ratio 


is  F -distributed  with  (m  —  1)  degrees  of  freedom  in  the  numerator  and  (n  —  1) 
degrees  of  freedom  in  the  denominator. 


8.4  Key  Points  and  Further  Issues 

Note: 

/  Examples  of  different  distributions  are: 

Distribution  Example 

Uniform  Rolling  a  die  (discrete) 

Waiting  for  a  train  (continuous) 

Bernoulli  Any  binary  variable  such  as  gender 

Binomial  Number  of  “heads”  when  tossing  a  coin  n  times 

Poisson  Number  of  particles  emitted  by  a  radioactive  source 

entering  a  small  area  in  a  given  time  interval 
Multinomial  Categorical  variables  such  as  “party  voted  for” 
Geometric  Number  of  raffle  tickets  until  first  ticket  wins 
Hypergeometric  National  lotteries;  Fisher’s  test,  see  p.  428 
Normal  Height  or  weight  of  women  (men) 

Exponential  Survival  time  of  a  PC 

X2  Sample  variance;  x2  tests,  see  p.  235  ff 

t  Confidence  interval  for  the  mean,  see  p.  197 

F  Tests  in  the  linear  model,  see  p.  272 
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Note: 


/  One  can  use  R  to  determine  values  of  densities  (PDF/PMF),  cumula¬ 
tive  probability  distribution  functions  (CDF),  quantiles  of  the  CDF,  and 
random  numbers: 


First  letter  Function 

Further  letters 

Example 

d 

Density 

distribution 

name  dnorm 

P 

Probability 

distribution 

name  pnorm 

q 

Quantiles 

distribution 

name  qnorm 

r 

Random  number  distribution 

name  rnorm 

We  encourage  the  use  of  R  to  obtain  quantiles  of  sampling  distributions, 
but  Tables  C.1-C.3  also  list  some  of  them. 

/  In  this  chapter,  we  assumed  the  parameters  such  as  fi,  cr,  A,  and  others  to 
be  known.  In  Chap.  9,  we  will  propose  how  to  estimate  these  parameters 
from  the  data.  In  Chap.  10,  we  test  statistical  hypotheses  about  these 
parameters. 

/  For  n  i.i.d.  random  variables  X\,  X2,  . . . ,  Xn,  the  arithmetic  mean  X 
converges  to  a  N(fi,  cr2 /n)  distribution  as  n  tends  to  infinity.  See  Appen¬ 
dix  C.3  as  well  as  Exercise  8.11  for  the  Theorem  of  Large  Numbers  and 
the  Central  Limit  Theorem,  respectively. 


8.5  Exercises 

Exercise  8.1  A  company  producing  cereals  offers  a  toy  in  every  sixth  cereal  package 
in  celebration  of  their  50th  anniversary.  A  father  immediately  buys  20  packages. 

(a)  What  is  the  probability  of  finding  4  toys  in  the  20  packages? 

(b)  What  is  the  probability  of  finding  no  toy  at  all? 

(c)  The  packages  contain  three  toys.  What  is  the  probability  that  among  the  5  pack¬ 
ages  that  are  given  to  the  family’s  youngest  daughter,  she  finds  two  toys? 

Exercise  8.2  A  study  on  breeding  birds  collects  information  such  as  the  length  of 
their  eggs  (in  mm).  Assume  that  the  length  is  normally  distributed  with  /i  =  42. 1  mm 
and  a2  =  20. 82.  What  is  the  probability  of 

(a)  finding  an  egg  with  a  length  greater  than  50  mm? 

(b)  finding  an  egg  between  30  and  40  mm  in  length? 

Calculate  the  results  both  manually  and  by  using  R. 
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Exercise  8.3  A  dodecahedron  is  a  die  with  12  sides.  Suppose  the  numbers  on  the  die 
are  1-12.  Consider  the  random  variable  X  which  describes  which  number  is  shown 
after  rolling  the  die  once.  What  is  the  distribution  of  XI  Determine  E(X)  and  Var(X). 

Exercise  8.4  Felix  states  that  he  is  able  to  distinguish  a  freshly  ground  coffee  blend 
from  an  ordinary  supermarket  coffee.  One  of  his  friends  asks  him  to  taste  10  cups 
of  coffee  and  tell  him  which  coffee  he  has  tasted.  Suppose  that  Felix  has  actually  no 
clue  about  coffee  and  simply  guesses  the  brand.  What  is  the  probability  of  at  least  8 
correct  guesses? 

Exercise  8.5  An  advertising  board  is  illuminated  by  several  hundred  bulbs.  Some 
of  the  bulbs  are  fused  or  smashed  regularly.  If  there  are  more  than  5  fused  bulbs  on 
a  day,  the  owner  of  the  board  replaces  them,  otherwise  not.  Consider  the  following 
data  collected  over  a  month  which  captures  the  number  of  days  (A/)  on  which  i  bulbs 
were  broken: 


Fused  bulbs 

0 

1 

2 

3 

4 

5 

ni 

6 

8 

8 

5 

2 

1 

(a)  Suggest  an  appropriate  distribution  for  X:  “number  of  broken  bulbs  per  day”. 

(b)  What  is  the  average  number  of  broken  bulbs  per  day?  What  is  the  variance? 

(c)  Determine  the  probabilities  P(X  =  x)  using  the  distribution  you  chose  in  (a) 
and  using  the  average  number  of  broken  bulbs  you  calculated  in  (b).  Compare 
the  probabilities  with  the  proportions  obtained  from  the  data. 

(d)  Calculate  the  probability  that  at  least  6  bulbs  are  fused,  which  means  they  need 
to  be  replaced. 

(e)  Consider  the  random  variable  Y :  “time  until  next  bulb  breaks”.  What  is  the 
distribution  of  Y  ? 

(f)  Calculate  and  interpret  E(T). 

Exercise  8.6  Marco’s  company  organizes  a  raffle  at  an  end-of-year  function.  There 
are  4000  raffle  tickets  to  be  sold,  of  which  500  win  a  prize.  The  price  of  each  ticket 
is  €1.50.  The  value  of  the  prizes,  which  are  mostly  electrical  appliances  produced 
by  the  company,  varies  between  €80  and  €250,  with  an  average  value  of  €142. 

(a)  Marco  wants  to  have  a  99  %  guarantee  of  receiving  three  prizes.  How  much 
money  does  he  need  to  spend?  Use  R  to  solve  the  question. 

(b)  Use  R  to  plot  the  function  which  describes  the  relationship  between  the  number 
of  tickets  bought  and  the  probability  of  winning  at  least  three  prizes. 

(c)  Given  the  value  of  the  prizes  and  the  costs  of  the  tickets,  is  it  worth  taking  part 
in  the  raffle? 
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Exercise  8.7  A  country  has  a  ratio  between  male  and  female  births  of  1 .05  which 
means  that  51.22  %  of  babies  born  are  male. 

(a)  What  is  the  probability  for  a  mother  that  the  first  girl  is  born  during  the  first  three 
births? 

(b)  What  is  the  probability  of  getting  2  girls  among  4  babies? 

Exercise  8.8  A  fishermen  catches,  on  average,  three  fish  in  an  hour.  Let  Y  be  a 
random  variable  denoting  the  number  of  fish  caught  in  one  hour  and  let  X  be  the 
time  interval  between  catching  two  fishes.  We  assume  that  X  follows  an  exponential 
distribution. 

(a)  What  is  the  distribution  of  Y  ? 

(b)  Determine  E(Y)  and  E(A). 

(c)  Calculate  P(Y  =  5)  and  P(Y  <  1). 

Exercise  8.9  A  restaurant  sells  three  different  types  of  dessert:  chocolate,  brownies, 
yogurt  with  seasonal  fruits,  and  lemon  tart.  Years  of  experience  have  shown  that  the 
probabilities  with  which  the  desserts  are  chosen  are  0.2,  0.3,  and  0.5,  respectively. 

(a)  What  is  the  probability  that  out  of  5  guests,  2  guests  choose  brownies,  1  guest 
chooses  yogurt,  and  the  remaining  2  guests  choose  lemon  tart? 

(b)  Suppose  two  out  of  the  five  guests  are  known  to  always  choose  lemon  tart.  What 
is  the  probability  of  the  others  choosing  lemon  tart  as  well? 

(c)  Determine  the  expectation  and  variance  assuming  a  group  of  20  guests. 

Exercise  8.10  A  reinsurance  company  works  on  a  premium  policy  for  natural  disas¬ 
ters.  Based  on  experience,  it  is  known  that  W  =  “number  of  natural  disasters  from 
October  to  March”  (winter)  is  Poisson  distributed  with  A^  =  4.  Similarly,  the  ran¬ 
dom  variable  S  =  “number  of  natural  disasters  from  April  to  September”  (summer) 
is  Poisson  distributed  with  A^  =  3.  Determine  the  probability  that  there  is  at  least  1 
disaster  during  both  summer  and  winter  based  on  the  assumption  that  the  two  random 
variables  are  independent. 


Exercise  8.11  Read  Appendix  C.3  to  learn  about  the  Theorem  of  Large  Numbers 
and  the  Central  Limit  Theorem. 

(a)  Draw  1 000  realizations  from  a  standard  normal  distribution  using  R  and  calculate 
the  arithmetic  mean.  Repeat  this  process  1000  times.  Evaluate  the  distribution 
of  the  arithmetic  mean  by  drawing  a  kernel  density  plot  and  by  calculating  the 
mean  and  variance  of  it. 
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(b)  Repeat  the  procedure  in  (a)  with  an  exponential  distribution  with  A  =  1 .  Interpret 
your  findings  in  the  light  of  the  Central  Limit  Theorem. 

(c)  Repeat  the  procedure  in  (b)  using  10,000  rather  than  1000  realizations.  How  do 
the  results  change  and  why? 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  375 


"Source  Toutenburg,  H.,  Heumann,  C.,  Induktive  Statistik ,  4th  edition,  2007, 
Springer,  Heidelberg 
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9.1  Introduction 

The  first  four  chapters  of  this  book  illustrated  how  one  can  summarize  a  data  set 
both  numerically  and  graphically.  The  validity  of  interpretations  made  from  such 
a  descriptive  analysis  is  valid  only  for  the  data  set  under  consideration  and  cannot 
necessarily  be  generalized  to  other  data.  However,  it  is  desirable  to  make  conclusions 
about  the  entire  population  of  interest  and  not  only  about  the  sample  data.  In  this 
chapter,  we  describe  the  framework  of  statistical  inference  which  allows  us  to 
infer  from  the  sample  data  about  the  population  of  interest-at  a  given,  prespecified 
uncertainty  level-and  knowledge  about  the  random  process  generating  the  data. 

Consider  an  example  where  the  objective  is  to  forecast  an  election  outcome. 
This  requires  us  to  determine  the  proportion  of  votes  that  each  of  the  k  participating 
parties  is  going  to  receive,  i.e.  to  calculate  or  estimate  p\ ,  p2, . . . ,  Pk-  If  it  is  possible 
to  ask  every  voter  about  their  party  preference,  then  one  can  simply  calculate  the 
proportions  p\,  p2, . . . ,  Pk  for  each  party.  However,  it  is  logistically  impossible  to 
ask  all  eligible  voters  (which  form  the  population  in  this  case)  about  their  preferred 
party.  It  seems  more  realistic  to  ask  only  a  small  fraction  of  voters  and  infer  from 
their  responses  to  the  responses  of  the  whole  population.  It  is  evident  that  there  might 
be  differences  in  responses  between  the  sample  and  the  population — but  the  more 
voters  are  asked,  the  closer  we  are  to  the  population’s  preference,  i.e.  the  higher 
the  precision  of  our  estimates  for  p\,  p2,  . . . ,  Pk  (the  meaning  of  “precision”  will 
become  clearer  later  in  this  chapter).  Also,  it  is  intuitively  clear  that  the  sample  must 
be  a  representative  sample  of  the  voters’  population  to  avoid  any  discrepancy  or  bias 
in  the  forecasting.  When  we  speak  of  a  representative  sample,  we  mean  that  all  the 
characteristics  present  in  the  population  are  contained  in  the  sample  too.  There  are 
many  ways  to  get  representative  random  samples.  In  fact,  there  is  a  branch  of  statistics, 
called  sampling  theory,  which  studies  this  subject  [see,  e.g.  Groves  et  al.  (2009)  or 
Kauermann  and  Kiichenhoff  (2011)  for  more  details].  A  simple  random  sample  is 
one  where  each  voter  has  an  equal  probability  of  being  selected  in  the  sample  and 
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each  voter  is  independently  chosen  from  the  same  population.  In  the  following,  we 
will  assume  that  all  samples  are  simple  random  samples.  To  further  formalize  the 
election  forecast  problem,  assume  that  we  are  interested  in  the  true  proportions  which 
each  party  receives  on  the  election  day.  It  is  practically  impossible  to  make  a  perfect 
prediction  of  these  proportions  because  there  are  too  many  voters  to  interview,  and 
moreover,  a  voter  may  possibly  make  their  final  decision  possibly  only  when  casting 
the  vote  and  not  before.  The  voter  may  change  his/her  opinion  at  any  moment  and  may 
differ  from  what  he/she  claimed  earlier.  In  statistics,  we  call  these  true  proportions 
parameters  of  the  population.  The  task  is  then  to  estimate  these  parameters  on  the 
basis  of  a  sample.  In  the  election  example,  the  intuitive  estimates  for  the  proportions 
in  the  population  are  the  proportions  in  the  sample  and  we  call  them  sample  estimates. 
How  to  find  good  and  precise  estimates  are  some  of  the  challenges  that  are  addressed 
by  the  concept  of  statistical  inference.  Now,  it  is  possible  to  describe  the  election 
forecast  problem  in  a  statistical  and  operational  framework:  estimate  the  parameters 
of  a  population  by  calculating  the  sample  estimates.  An  important  property  of  every 
good  statistical  inference  procedure  is  that  it  provides  not  only  estimates  for  the 
population  parameters  but  also  information  about  the  precision  of  these  estimates. 

Consider  another  example  in  which  we  would  like  to  study  the  distribution  of 
weight  of  children  in  different  age  categories  and  get  an  understanding  of  the  “nor¬ 
mal”  weight.  Again,  it  is  not  possible  to  measure  the  weight  of  all  the  children  of 
a  specific  age  in  the  entire  population  of  children  in  a  particular  country.  Instead, 
we  draw  a  random  sample  and  use  methods  of  statistical  inference  to  estimate  the 
weight  of  children  in  each  age  group.  More  specifically,  we  have  several  populations 
in  this  problem.  We  could  consider  all  boys  of  a  specific  age  and  all  girls  of  a  spe¬ 
cific  age  as  two  different  populations.  For  example,  all  3 -year-old  boys  will  form 
one  possible  population.  Then,  a  random  sample  is  drawn  from  this  population.  It  is 
reasonable  to  assume  that  the  distribution  of  the  weight  of  k-year-old  boys  follows  a 
normal  distribution  with  some  unknown  parameters  pkb  and  cr%b.  Similarly,  another 
population  of  k-year-old  girls  is  assumed  to  follow  a  normal  distribution  with  some 
unknown  parameters  pkg  and  cr|  .  The  indices  kb  and  kg  are  used  to  emphasize 
that  the  parameters  may  vary  by  age  and  gender.  The  task  is  now  to  calculate  the 
estimates  of  the  unknown  parameters  (in  the  population)  of  the  normal  distributions 
from  the  samples.  Using  quantiles,  a  range  of  “normal”  weights  can  then  be  specified, 
e.g.  the  interval  from  the  1  %  quantile  to  the  99  %  quantile  of  the  estimated  normal 
distribution  or,  alternatively,  all  weights  which  are  not  more  than  twice  the  standard 
deviation  away  from  the  mean.  Children  with  weights  outside  this  interval  may  be 
categorized  as  underweight  or  overweight.  Note  that  we  make  a  specific  assumption 
for  the  distribution  class;  i.e.  we  assume  a  normal  distribution  for  the  weights  and 
estimate  its  parameters.  We  call  this  a  parametric  estimation  problem  because  it  is 
based  on  distributional  assumptions.  Otherwise,  if  no  distributional  assumptions  are 
made,  we  speak  of  a  nonparametric  estimation  problem. 
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9.2  Properties  of  Point  Estimators 

As  we  discussed  in  the  introduction,  the  primary  goal  in  statistical  inference  is  to  find 
a  good  estimate  of  (a)  population  parameter(s).  The  parameters  are  associated  with 
the  probability  distribution  which  is  believed  to  characterize  the  population;  e.g.  fi 
and  a2  are  the  parameters  in  a  normal  distribution  N  (/a,  a2).  If  these  parameters  are 
known,  then  one  can  characterize  the  entire  population.  In  practice,  these  parameters 
are  unknown,  so  the  objective  is  to  estimate  them.  One  can  attempt  to  obtain  them 
based  on  a  function  of  the  sample  values.  But  what  does  this  function  look  like;  and 
if  there  is  more  than  one  such  function,  then  which  is  the  best  one?  What  is  the  best 
approach  to  estimate  the  population  parameters  on  the  basis  of  a  given  sample  of 
data?  The  answer  is  given  by  various  statistical  concepts  such  as  bias,  variability, 
consistency,  efficiency,  sufficiency,  and  completeness  of  the  estimates.  We  are  going 
to  introduce  them  now. 

Assume  v  =  (x\ ,  X2  . . . ,  xn)  are  the  observations  of  a  random  sample  from  a  pop¬ 
ulation  of  interest.  The  random  sample  represents  the  realized  values  of  a  random 
variable  X.  It  can  be  said  that  x\ ,  X2  . . . ,  xn  are  the  n  observations  collected  on  the 
random  variable  X.  Any  function  of  random  variables  is  called  a  statistic.  For  exam¬ 
ple,  X  =  Xi ,  max(Xi,  X2,  ... ,  Xn )  etc.  are  functions  of  X\,  X2, . . . ,  Xn, 

so  they  are  a  statistic.  It  follows  that  a  statistic  is  also  a  random  variable.  Consider 
a  statistic  T  (X)  which  is  used  to  estimate  a  population  parameter  0  (which  may  be 
either  a  scalar  or  a  vector).  We  say  T (X)  is  an  estimator  of  0.  To  indicate  that  we 

/V 

estimate  0  using  T  (X),  we  use  the  “hat”  (  "  )  symbol,  i.e.  we  write  0  =  T  (X).  When 
T  is  calculated  from  the  sample  values  x\,  X2  . . . ,  xn,  we  write  T(x)  and  call  it  an 
estimate  of  6.  It  becomes  clear  that  T  (X)  is  a  random  variable  but  T(x)  is  its  observed 
value  (dependent  on  the  actual  sample).  For  example,  T (X)  =  X  =  ^  YU  Xi  is 

an  estimator  and  a  statistic,  but  T(x)  =  x  =  ^  X/=i  xi  is  its  estimated  value  from 
the  realized  sample  values  x\,  X2,  •  •  • ,  xn.  Since  the  sample  values  are  realizations 
from  a  random  variable,  each  sample  leads  to  a  different  value  of  the  estimate  of 
the  population  parameter.  The  population  parameter  is  assumed  to  be  a  fixed  value. 
Parameters  can  also  be  assumed  to  be  random,  for  example  in  Bayesian  statistics, 
but  this  is  beyond  the  scope  of  this  book. 


9.2.1  Unbiasedness  and  Efficiency 

Definition  9.2.1  An  estimator  T(X)  is  called  an  unbiased  estimator  of  0  if 

E  e(T(X))  =  6.  (9.1) 

The  index  6  denotes  that  the  expectation  is  calculated  with  respect  to  the  distribution 
whose  parameter  is  0. 
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The  bias  of  an  estimator  T  ( X )  is  defined  as 

Bmse(T(X))=Ee(T(X))-6 .  (9.2) 

It  follows  that  an  estimator  is  said  to  be  unbiased  if  its  bias  is  zero. 

Definition  9.2.2  The  variance  of  T  (X)  is  defined  as 

Var e(T(X))  =  E  {[7(X)  -  E(7"(X))]2}  .  (9.3) 

Both  bias  and  variance  are  measures  which  characterize  the  properties  of  an  esti¬ 
mator.  In  statistical  theory,  we  search  for  “good”  estimators  in  the  sense  that  the  bias 
and  the  variance  are  as  small  as  possible  and  therefore  the  accuracy  is  as  high  as 
possible.  Readers  interested  in  a  practical  example  may  consult  Examples  9.2.1  and 
9.2.2,  or  the  explanations  for  Fig.  9.1. 

It  turns  out  that  we  cannot  minimize  both  measures  simultaneously  as  there  is 
always  a  so-called  bias- variance  tradeoff.  A  measure  which  combines  bias  and  vari¬ 
ance  into  one  measure  is  the  mean  squared  error. 

Definition  9.2.3  The  mean  squared  error  (MSE)  of  T  (X)  is  defined  as 

MSEg(T(X))  =  E{[T(X)  -9]2}  .  (9.4) 

The  expression  (9.4)  can  be  partitioned  into  two  parts:  the  variance  and  the  squared 
bias,  i.e. 

MSEtf(T(X))  =  Var0(r(X))  +  [Biase(T(X))]2  .  (9.5) 

This  can  be  proven  as  follows: 

MSE0(r(X))  =  E[T(X)  -  6>]2 

=  E[(T(X)  -  E g(T(X))  +  (E g(T{X)  -  9)}2 
=  E[7’(X)  -  Eg(T (X))]2  +  [E e(T(X))  -  9]2 
=  Var g(T(X))  +  [Bias0(r(X))]2  . 

Note  that  the  calculation  is  based  on  the  result  that  the  cross  product  term  is  zero. 
The  mean  squared  error  can  be  used  to  compare  different  biased  estimators. 

Definition  9.2.4  An  estimator  T\  (X)  is  said  to  be  MSE-better  than  another  estimator 
72  (X)  for  estimating  6  if 

MSE*(7i(X))  <  MSE*(r2(X))  , 

where  0  e  0  and  &  is  the  parameter  space,  i.e.  the  set  of  all  possible  values  of  0. 
Often,  0  is  R  or  all  positive  real  values  R+.  For  example,  for  a  normal  distribution, 
N(fi,  a2),  n  can  be  any  real  value  and  a2  has  to  be  a  number  greater  than  zero. 


9.2  Properties  of  Point  Estimators 


185 


Unfortunately,  we  cannot  find  an  MSE-optimal  estimator  in  the  sense  that  an 
estimator  is  MSE-better  than  all  other  possible  estimators  for  all  possible  values  of 
6.  This  becomes  clear  if  we  define  the  constant  estimator  T(x)  =  c  (independent 
of  the  actual  sample):  if  9  =  c,  i.e.  if  the  constant  value  equals  the  true  population 
parameter  we  want  to  estimate,  then  the  MSE  of  this  constant  estimator  is  zero 
(but  it  will  be  greater  than  zero  for  all  other  values  of  6,  and  the  bias  increases 
more  as  we  move  c  far  away  from  the  true  0).  Usually,  we  can  only  find  estimators 
which  are  locally  best  (in  a  certain  subset  of  (9).  This  is  why  classical  statistical 
inference  restricts  the  search  for  best  estimators  to  the  class  of  unbiased  estimators. 
For  unbiased  estimators,  the  MSE  is  equal  to  the  variance  of  an  estimator.  In  this 
context,  the  following  definition  is  used  for  comparing  two  (unbiased)  estimators. 


Definition  9.2.5  An  unbiased  estimator  T\(X)  is  said  to  be  more  efficient  than 
another  unbiased  estimator  T2  (X)  for  estimating  6  if 

Var0(7i(X))  <  Var e(T2(X))  ,  VO  e  0  , 


and 


Var0(7i(X))  <  Var0(r2(X)) 


for  at  least  one  0  e  0 .  It  turns  out  that  restricting  our  search  of  best  estimators  to 
unbiased  estimators  is  sometimes  a  successful  strategy;  i.e.  for  many  problems,  a 
best  or  most  efficient  estimate  can  be  found.  If  such  an  estimator  exists,  it  is  said  to 
be  UMVU  (uniformly  minimum  variance  unbiased).  Uniformly  means  that  it  has  the 
lowest  variance  among  all  other  unbiased  estimators  for  estimating  the  population 
parameter(s)  0. 


Consider  the  illustration  in  Fig.  9.1  to  better  understand  the  introduced  concepts. 
Suppose  we  throw  three  darts  at  a  target  and  the  goal  is  to  hit  the  centre  of  the  target, 
i.e.  the  innermost  circle  of  the  dart  board.  The  centre  represents  the  population 

/K  /K  /V 

parameter  0.  The  three  darts  play  the  role  of  three  estimates  9\,  02,03  (based  on 
different  realizations  of  the  sample)  of  the  population  parameter  6.  Four  possible 
situations  are  illustrated  in  Fig.  9.1.  For  example,  in  Fig.  9.1b,  we  illustrate  the  case 
of  an  estimator  which  is  biased  but  has  low  variance:  all  three  darts  are  “far”  away 
from  the  centre  of  the  target,  but  they  are  “close”  together.  If  we  look  at  Fig.  9.1a,  c, 
we  see  that  all  three  darts  are  symmetrically  grouped  around  the  centre  of  the  target, 
meaning  that  there  is  no  bias;  however,  in  Fig.  9.1a  there  is  much  higher  precision 
than  in  Fig.  9.1c.  It  is  obvious  that  Fig.  9.1a  presents  an  ideal  situation:  an  estimator 
which  is  unbiased  and  has  minimum  variance. 


Theorem  9.2.1  Let  X  =  (X  \ ,  X2  . . . ,  Xn)  be  an  i.i.d.  ( random )  sample  of  a  random 
variable  X  with  population  mean  E(X()  =  p  and  population  variance  Var(Xi)  = 
a2,  for  all  i  =  1,  2,  . . . ,  n.  Then  the  arithmetic  mean  X  =  X/=i  *s  an  unbiased 
estimator  of  p  and  the  sample  variance  S 2  =  —  ^)2  ^  an  unbiased 

estimator  of  cr2. 


186 


9  Inference 


(a)  No  bias,  low  variance 


(b)  Biased,  low  variance 


(c)  No  bias,  high  variance 

Fig.  9.1  Illustration  of  bias  and  variance 


(d)  Biased,  high  variance 


Note  that  the  theorem  holds,  in  general,  for  i.i.d.  samples,  irrespective  of  the  choice 
of  the  distribution  of  the  X/  ’s.  Note  again  that  we  are  looking  at  the  situation  before 
we  have  any  observations  on  X.  Therefore,  we  again  use  capital  letters  to  denote 
that  the  X*  ’s  are  random  variables  which  are  not  known  beforehand  (i.e.  before  we 
actually  record  the  observations  on  our  selected  sampling  units). 

Remark  9.2.1  The  empirical  variance  S2  =  ^  X/?=i  (X/  —  X)2  is  a  biased  estimate 
of  a2  and  its  bias  is  —  -a2. 

n 


Example  9.2.1  Let  X\ ,  X2  . . . ,  Xn  be  identically  and  independently  distributed  vari¬ 
ables  whose  population  mean  is  p  and  population  variance  is  a2.  Then  X  = 
^  X/=i  Xi  is  an  unbiased  estimator  of  p.  This  can  be  shown  as  follows: 


E(X)  =  E 


(7.29)  1 
n 


XE(V) 

i= 1 
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The  variance  of  X  can  be  calculated  as  follows: 


1  n 

Var(X)  =  — ^  ^  Var(X;)  ,  [Cov(X/,  X j)  =0  using  independence  of  X;’s  ] 

Yl 

i  —  \ 


i  n  2 

=  il”1'- 

ft 

/  —  I 

2 

We  conclude  that  X  is  an  unbiased  estimator  of  p  and  its  variance  is  irrespective 

of  the  choice  of  the  distribution  of  X.  We  have  learned  about  the  distribution  of  X 
already  in  Chap.  8,  see  also  Appendix  C.3  for  the  Theorem  of  Large  Numbers  and 
the  Central  Limit  Theorem;  however,  we  would  like  to  highlight  the  property  of 
“unbiasedness”  in  the  current  context. 


Now,  we  consider  another  example  to  illustrate  that  estimators  may  not  always  be 
unbiased  but  may  have  the  same  variance. 


Example  9.2.2  Let  X\ ,  X2  . . . ,  Xn  be  identically  and  independently  distributed  vari¬ 
ables  whose  population  mean  is  p  and  population  variance  is  a2 .  Then  X  =  X  +  1  = 
^  X/7=i  (X;  +  1)  is  a  biased  estimator  of  p.  This  can  be  shown  as  follows: 


E(X) 


(7229)  1 
n 


i= 1 


1 

n 


n 


yy + 1 


—  p  +  1 7^  p  • 


However,  the  variance  of  X  is 

2 

Var(X)  =  Var  (X  +  l)  (7=4)  Var(X)  =  —  . 

n 

If  we  compare  the  two  estimators  X  =  ^  +  1)  and  X  =  ^  X/?=i  (X/),  we 

see  that  both  have  the  same  variance  but  the  former  (X)  is  biased.  The  efficiency 
of  both  estimators  is  thus  the  same.  It  further  follows  that  the  mean  squared  error 
of  X  is  smaller  than  the  mean  squared  error  of  X  because  the  MSE  consists  of  the 
sum  of  the  variance  and  the  squared  bias.  Therefore  X  is  MSE-better  than  X.  The 
comparison  of  bias,  variance  and  MSE  tells  us  that  we  should  prefer  X  over  X  when 
estimating  the  population  mean.  This  is  intuitive,  but  the  argument  we  make  is  a 
purely  statistical  one. 


Theorem  9.2.1  contains  the  following  special  cases: 

•  The  sample  mean  X  =  ^  based  on  an  i.i.d.  random  sample  X\ ,  X2, 

Xn  from  a  normally  distributed  population  N(p,  a2)  is  an  unbiased  point  estimator 
of  p. 
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•  The  sample  variance  S 2  =  X/=i  (%i  ~  -*02  based  on  an  i.i.d.  random  sample 

X\,  X2,  •  •  • ,  Xn  from  a  normally  distributed  population  N(p,  a2)  is  an  unbiased 
point  estimator  of  a2.  The  sample  variance  S2  =  ^  —  X)2  is  a  biased 

estimator  for  a2,  but  it  is  asymptotically  unbiased  in  the  sense  that  its  bias  tends 
to  zero  as  the  sample  size  n  tends  to  infinity. 

•  The  sample  mean  X  =  ^  based  on  an  i.i.d.  random  sample  X\ ,  X2,  . . . , 

Xn  from  a  Bernoulli  distributed  population  B(  1,  p)  is  an  unbiased  point  estimator 
of  the  probability  p. 


For  illustration,  we  show  the  validity  of  the  third  statement.  Let  us  consider  an  i.i.d. 
random  sample  X/,  i  =  1,2 ...  ,n,  from  a  Bernoulli  distribution,  where  X/  =  1  if 
an  event  occurs  and  X*  =  0  otherwise.  Here,  p  is  the  probability  of  occurrence  of 
an  event  in  the  population,  i.e.  p  =  P(Xj  =  1).  Note  that  p  is  also  the  population 
mean:  E(X/)  =  1  •  p  +  0  •  (1  —  p)  =  p,i  =  1,2 ,  ...  ,n.  The  arithmetic  mean  (rel¬ 
ative  frequency)  is  an  unbiased  estimator  of  p  because 


E(X)  = 


n 

11  i= 1 


and  thus,  we  can  write  the  estimate  of  p  as 


(9.6) 


Example  9.2.3  Suppose  a  random  sample  of  size  n  =  20  of  the  weight  of  10-year- 
old  children  in  a  particular  city  is  drawn.  Let  us  assume  that  the  children’s  weight 
in  the  population  follows  a  normal  distribution  N(fi,  a2).  The  sample  provides  the 
following  values  of  weights  (in  kg): 

40.2,  32.8,  38.2,  43.5,  47.6,  36.6,  38.4,  45.5,  44.4,  40.3 
34.6,  55.6,  50.9,  38.9,  37.8,  46.8,  43.6,  39.5,  49.9,  34.2 


To  obtain  an  estimate  of  the  population  mean  p,  we  calculate  the  arithmetic  mean  of 
the  observations  as 

1  n  1 

fi  =  x  =  -  2_,Xi  =  —(40.2  +  32.8  +  •  •  •  +  34.2)  =  41.97, 

/  —  I 

because  it  is  an  unbiased  estimator  of  p.  Similarly,  we  use  S 2  to  estimate  a2  because 
it  is  unbiased  in  comparison  to  S 2 .  Using  s\  as  an  estimate  for  a2  for  the  given 
observations,  we  get 


1 


y^,(xj  -  x)2 

i=i 


i  i 

=  —((40.2  -  41. 97)2  +  •  •  •  +  (34.2 


41.97)2)  36.85. 
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The  square  root  of  36.85  is  approximately  6.07  which  is  the  standard  deviation.  Note 
that  the  standard  deviation  based  on  the  sample  values  divided  by  the  square  root  of 
the  sample  size,  i.e.  a/  V20,  is  called  the  standard  error  of  the  mean  X  (SEM).  As 
already  introduced  in  Chap.  3,  we  obtain  these  results  in  R  using  the  mean  and  var 
commands. 


Example  9.2.4  A  library  draws  a  random  sample  of  size  n  =  100  members  from  the 
members’  database  to  see  how  many  members  have  to  pay  a  penalty  for  returning 
books  late,  i.e.  v;  =  1.  It  turns  out  that  39  members  in  the  sample  have  to  pay  a  penalty. 
Therefore,  an  unbiased  estimator  of  the  population  proportion  of  all  members  of  the 
library  who  return  books  late  is 


39 

Too 


0.39. 


Remark  9.2.2  Unbiasedness  and  efficiency  can  also  be  defined  asymptotically:  we 
say,  for  example,  that  an  estimator  is  asymptotically  unbiased,  if  the  bias  approaches 
zero  when  the  sample  size  tends  to  infinity.  The  concept  of  asymptotic  efficiency 
involves  some  mathematical  knowledge  which  is  beyond  the  intended  scope  of 
this  book.  Loosely  speaking,  an  asymptotic  efficient  estimator  is  an  estimator 
which  achieves  the  lowest  possible  (asymptotic)  variance  under  given  distribu¬ 
tional  assumptions.  The  estimators  introduced  in  Sect.  9.3.1,  which  are  based  on  the 
maximum  likelihood  principle,  have  these  properties  (under  certain  mathematically 
defined  regularity  conditions). 


Next,  we  illustrate  the  properties  of  consistency  and  sufficiency  of  an  estimator. 


9.2.2  Consistency  of  Estimators 

For  a  good  estimator,  as  the  sample  size  increases,  the  values  of  the  estimator  should 
get  closer  to  the  parameter  being  estimated.  This  property  of  estimators  is  referred 
to  as  consistency. 

Definition  9.2.6  Let  T\,  T2,  . . . ,  Tn,  be  a  sequence  of  estimators  for  the  parameter 
0  where  Tn  =  Tn(X\,  X2,  . . . ,  Xn)  is  a  function  of  X\,  X2 ,  . . . ,  Xn.  The  sequence 
{ Tn }  is  a  consistent  sequence  of  estimators  for  0  if  for  every  e  >  0, 

lim  P[\Tn  -0\  <e]  =  l 

n^oo 

or  equivalently 

lim  P  \\T„  -  0\  >  e]  =  0. 


190 


9  Inference 


This  definition  says  that  as  the  sample  size  n  increases,  the  probability  that  Tn  is 
getting  closer  to  0  is  approaching  1.  This  means  that  the  estimator  Tn  is  getting 
closer  to  the  parameter  0  as  n  grows  larger.  Note  that  there  is  no  information  on  how 
fast  Tn  is  converging  to  6  in  the  sense  of  convergence  defined  above. 


Example  9.2.5  Let  X\ ,  . . . ,  Xn  be  identically  and  independently  distributed  vari¬ 
ables  with  expectation  p  and  variance  a2.  Then  for  Xn  =  ^”=1  X;,  we  have 

E(XW)  =  p  and  Var(Xw)  =  a2 In.  For  any  e  >  0,  we  can  write  the  following: 


(i\>e]  =  P 


ca 


where  e  =  ca/ *Jn.  Using  Tschebyschev’s  inequality  (Theorem  7.4.1,  p.  139),  we 
get  —  a2 /ne2,  and  therefore 


P 


and 


lim  P 

n^o o 


<  lim 


a“ 


n^o o  nc 


provided  a2  is  finite.  Hence  Xn,n  =  1,2,...,  converges  to  p  and  therefore  Xn  is  a 
consistent  estimator  of  p. 


Remark  9.2.3  We  call  this  type  of  consistency  weak  consistency.  Another  definition 
is  MSE  consistency,  which  says  that  an  estimator  is  MSE  consistent  if  MSE  — >  0 
as  n  — >►  oo.  If  the  estimator  is  unbiased,  it  is  sufficient  that  Var  — >  0  as  n  — >  oo.  If 
Tn(X)  is  MSE  consistent,  it  is  also  weakly  consistent.  Therefore,  it  follows  that  an 
unbiased  estimator  with  its  variance  approaching  zero  as  the  sample  size  approaches 
infinity  is  both  MSE  consistent  and  weakly  consistent. 

In  Example  9.2.5,  the  variance  of  Tn(X)  =  Xn  is  a2 /n  which  goes  to  zero  as  n 
goes  to  oo  and  therefore  Xn  is  both  weakly  consistent  and  MSE  consistent. 


9.2.3  Sufficiency  of  Estimators 

Sufficiency  is  another  criterion  to  judge  the  quality  of  an  estimator.  Before  delving 
deeper  into  the  subject  matter,  we  first  try  to  understand  some  basic  concepts. 

Consider  two  independent  random  variables  X  and  Y,  each  following  a  N(p,  1) 
distribution.  We  conclude  that  both  X  and  Y  contain  information  about  p.  Consider 
two  estimators  of  p  as  p\  =  X  +  Y  and  fi>2  =  X  —  Y.  Suppose  we  want  to  know 
whether  to  use  p\  or  pi  to  estimate  p.  We  notice  that  E(p\)  =  E(X)  +  E(y)  =  p  + 
p  =  2 p,  E(p2)  =  E(X)  —  E(T)  =  p  —  p  =  0,  Var(pq)  =  Var(X)  +  Var(T)  =  1  + 
1=2  and  Var (/i2)  =  Var(X)  +  Var(T)  =  1  +  1  =  2.  Using  the  additivity  property 
of  the  normal  distribution,  which  was  introduced  in  Remark  8.2.2,  we  can  say  that 
pi  ~  N (2 p,  2)  and  +2  ~  7V(0,  2).  So  p\  contains  information  about  p,  whereas  p2 
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does  not  contain  any  information  about  p.  In  other  words,  p2  loses  the  information 
about  p.  We  call  this  property  “loss  of  information”. 

If  we  want  to  make  conclusions  about  p  using  both  X  and  Y,  we  need  to  acknowl¬ 
edge  that  the  dimension  of  them  is  2.  On  the  other  hand,  if  we  use  p\  or  equivalently 
p i/2  ~  Nip,  \),  then  we  need  to  concentrate  only  on  one  variable  and  we  say  that 
it  has  dimension  1.  It  follows  that  p\  and  p i/2  provide  the  same  information  about 
p  as  provided  by  the  entire  sample  on  both  X  and  Y.  So  we  can  say  that  either  p\ 
or  /ii/2  is  sufficient  to  provide  the  same  information  about  p  that  can  be  obtained 
on  the  basis  of  the  entire  sample.  This  is  the  idea  behind  the  concept  of  sufficiency 
and  it  results  in  the  reduction  of  dimension.  In  general,  we  can  say  that  if  all  the 
information  about  p  contained  in  the  sample  of  size  n  can  be  obtained,  for  example, 
through  the  sample  mean  then  it  is  sufficient  to  use  this  one-dimensional  summary 
statistic  to  make  inference  about  p. 

Definition  9.2.7  Let  X \ ,  X2, . . . ,  Xn  be  a  random  sample  from  a  probability  density 
function  (or  probability  mass  function)  fix,  6).  A  statistic  T  is  said  to  be  sufficient 
for  6  if  the  conditional  distribution  of  X\,  X2,  . . . ,  Xn  given  T  —  t  is  independent 
of  6. 

The  Neyman-Fisher  Factorization  Theorem  provides  a  practical  way  to  find  suf¬ 
ficient  statistics. 


Theorem  9.2.2  (Neyman-Fisher  Factorization  Theorem  (NFFT))  Let  X\,  X2,  . . . , 
Xn  be  a  random  sample  from  a  probability  density  function  ( or  probability  mass 
function )  fix,  0).  A  statistic  T  =  T(x\,  X2,  •  •  • ,  xn )  is  said  to  be  sufficient  for  0  if 
and  only  if  the  joint  density  of  X\,  X2,  . . . ,  Xn  can  be  factorized  as 

f(x  i,x2,  0)  =  git,  0)  •  h(x\,  x2, . . . ,  xn) 

where  h(x\ ,  X2, . . . ,  xn)  is  nonnegative  and  does  not  involve  0;  and  g(t,  0)  is  a  non¬ 
negative  function  of  6  which  depends  on  x\,  X2,  . . . ,  xn  only  through  t,  which  is  a 
particular  value  ofT. 

This  theorem  holds  for  discrete  random  variables  too.  Any  one-to-one  function  of 
a  sufficient  statistic  is  also  sufficient.  A  function  f  is  called  one-to-one  if  whenever 
f(a)  =  f(b)  then  a  =  b. 

Example  9.2.6  Let  X\,  X2 , . . . ,  Xn  be  a  random  sample  from  Nip,  1)  where  p  is 
unknown.  We  attempt  to  find  a  sufficient  statistic  for  p.  Consider  the  following 
function  as  the  joint  distribution  of  x\ ,  X2,  . . . ,  xn  (whose  interpretation  will  become 
clearer  in  the  next  section): 

1  v 

vd  exp 

= (X) exp 


( 


192 


9  Inference 


Here 


g(t,  n)  = 


<i) 


n 


exp 


h(x  i,X2, .  ..,xn)  =  exp  (  -- 


(-£4 


n 

t  —  t  (x  i ,  X 2 ,  .  .  .  ,  Xfi )  —  X( . 

i=\ 


Using  the  Neyman-Fisher  Factorization  Theorem,  we  conclude  that  T  =  T(X  i, 
X2,  . . . ,  Wi)  =  Z"=i  Xi  is  a  sufficient  statistic  for  p.  Also,  T  =  T  (X 1, 

X2, . . . ,  Xn)  =  X  is  sufficient  for  (i  as  it  is  a  one-to-one  statistic  of  X?=i  Xi.  On 
the  other  hand,  T  =  X2  is  not  sufficient  for  p  as  it  is  not  a  one-to-one  function  of 
X! =1  Xi .  The  important  point  here  is  that  X  is  a  function  of  the  sufficient  statistic  and 
hence  a  good  estimator  for  (i.  It  is  thus  summarizing  the  sample  information  about 
the  parameter  of  interest  in  a  complete  yet  parsimonious  way.  Another,  multivariate, 
example  of  sufficiency  is  given  in  Appendix  C.4. 


9.3  Point  Estimation 

In  the  previous  section,  we  introduced  and  discussed  various  properties  of  estimators. 
In  this  section,  we  want  to  show  how  one  can  find  estimators  with  good  properties.  In 
the  general  case,  properties  such  as  unbiasedness  and  efficiency  cannot  be  guaranteed 
for  a  finite  sample.  But  often,  the  properties  can  be  shown  to  hold  asymptotically. 


9.3.1  Maximum  Likelihood  Estimation 

We  have  used  several  estimators  throughout  the  book  without  stating  explicitly  that 
they  are  estimators.  For  example,  we  used  the  sample  mean  (X)  to  estimate  in 
a  cr2)  distribution;  we  also  used  the  sample  proportion  (relative  frequency)  to 
estimate  p  in  a  B(1 ,  p)  distribution,  etc.  The  obvious  question  is  how  to  obtain  a  good 
statistic  to  estimate  an  unknown  parameter,  for  example  how  to  determine  that  the 
sample  mean  can  be  used  to  estimate  fi.  We  need  a  general  framework  for  parameter 
estimation.  The  method  of  maximum  likelihood  provides  such  an  approach.  For  the 
purpose  of  illustration,  we  introduce  the  method  of  maximum  likelihood  estimation 
with  an  example  using  the  Bernoulli  distribution. 

Example  9.3.1  Consider  an  i.i.d.  random  sample  X  =  (X\,  X2 ,  . . . ,  Xn)  from  a 
Bernoulli  population  with  p  =  P(Xi  =  1)  and  (1  —  p)  =  P{Xi  =  0).  The  joint 
probability  mass  function  for  a  given  set  of  realizations  x\ ,  X2 ,  . . . ,  xn  (i.e.  the  data)  is 
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P(Xi  =  XI,  X2  =  x2, . . . ,  xn  =  x„\ p)  =  P(X  1  =x\\p) . P(Xn  =  xn\p) 

n 

=  Y[pXi(\-p)l~Xi.  (9.7) 

i= 1 

This  is  a  function  of  (jq,  X2,  . . . ,  xn)  given  the  parameter  p.  The  product  results 
from  the  fact  that  the  draws  are  independent  and  the  fact  that  pXi  (1  —  p)l~Xi  =  p  if 
Xi  =  1  and  pXi  (1  —  p)l~Xi  =  1  —  p  if  Xi  =  0.  That  is,  the  term  pXi  (1  —  p)l~Xi  cov¬ 
ers  results  from  both  possible  outcomes.  Now,  consider  a  random  sample  where  the 
values  jc  =  (*1,  *2, . . . ,  xn)  are  known,  for  example  jc  =  (0,  1,  0,  0, ... ,  1).  Then, 
(9.7)  can  be  seen  as  a  function  of  p  because  (x\ ,  X2, . . . ,  xn)  is  known.  In  this  case, 
after  obtaining  a  sample  of  data,  the  function  is  called  the  likelihood  function  and 
can  be  written  as 

n 

L(x\,x2,  ...  ,xn\p)  =  Y\pxHl  -  p)l~Xi  ■  (9.8) 

/  —  I 

The  joint  density  function  of  X\,  X2,  . . . ,  Xn  is  called  the  likelihood  function. 
For  better  understanding,  consider  a  sample  of  size  5  with  x  =  (x\  =  1,  X2  =  1,  X3  = 
0,  X4  =  1,  V5  =  0).  The  likelihood  (function)  is 

L(l,  1,  0, 1,  0| p)  =  p  ■  p  ■  (1  -  p)  ■  p  ■  (1  -  p)  =  p3(  1  -  p)2  .  (9.9) 

The  maximum  likelihood  estimation  principle  now  says  that  the  estimator  p  of  p 
is  the  value  of  p  which  maximizes  the  likelihood  (9.8)  or  (9.9).  In  other  words, 
the  maximum  likelihood  estimate  is  the  value  which  maximizes  the  probability  of 
observing  the  realized  sample  from  the  likelihood  function.  In  general,  i.e.  for  any 
sample,  we  have  to  maximize  the  likelihood  function  (9.9)  with  respect  to  p.  We  use 
the  well-known  principle  of  maxima-minima  to  maximize  the  likelihood  function  in 
this  case.  In  principle,  any  other  optimization  procedure  can  also  be  used,  for  example 
numerical  algorithms  such  as  the  Newton-Raphson  algorithm.  If  the  likelihood  is 
differentiable,  the  first-order  condition  for  the  maximum  is  that  the  first  derivative 
with  respect  to  p  is  zero.  For  maximization,  we  can  transform  the  likelihood  by  a 
strictly  monotone  increasing  function.  This  guarantees  that  the  potential  maximum 
is  taken  at  the  same  point  as  in  the  original  likelihood.  A  good  and  highly  common 
choice  is  the  natural  logarithm  since  it  transforms  products  in  sums  and  sums  are 
easy  to  differentiate  by  differentiating  each  term  in  the  sum.  The  log-likelihood  in 
our  example  is  therefore 

/( 1,  1.  0, 1,  0| p)  =  In  L{  1, 1.0,  1, 0,  I p)  =  In  {p3(  1  -  p)2} 

=  3  \n(p)  +  2  ln(l  —  p) 

where  In  denotes  the  natural  logarithm  function  and  we  use  the  rules 

In  (a  •  b)  =  In  (a)  +  In  (b)  ,  a  >  0,  b  >  0 
In  ^  =  In  (a)  —  In  (b)  ,  a  >  0,  b  >  0 

In  (a b  ^  =  b  In  (a)  ,a>  0  . 


(9.10) 

(9.11) 
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Likelihood  as  a  function  dependent  on  p  log-Likelihood  as  a  function  dependent  on  p 


Fig.  9.2  Illustration  of  the  likelihood  and  log-likelihood  function  of  a  binomial  distribution 


Taking  the  first  derivative  of  (9.10)  with  respect  to  p  results  in 

<9/(1, 1,0,  l,0|p)  _  3  2  ,niON 

op  p  1  —  p 

Setting  (9.12)  to  zero  and  solving  for  p  leads  to 

3  2 

- =  0 

p  i  -  p 

3  _  2 

P  1  ~  P 

3(1  —  p)  =  2p 
5p  =  3 

Pml  =  -  =  -(1  +  1+  0+1+0)=x. 

The  value  of  the  second-order  partial  derivative  of  (9.9)  with  respect  to  p  at  p  =  pml 
is  negative  which  ensures  that  pml  maximizes  the  likelihood  function.  It  follows  from 
this  example  that  the  maximum  likelihood  estimate  for  p  leads  to  the  well-known 
arithmetic  mean.  Figure  9.2  shows  the  likelihood  function  and  the  log-likelihood 
function  as  functions  of  p ,  where  p  e  [0,  1].  The  figures  show  that  the  likelihood 
function  and  the  log-likelihood  function  have  the  same  maxima  at  p  =  3/5  =  0.6. 

Maximum  likelihood  estimators  have  some  important  properties:  they  are  usually 
consistent,  asymptotically  unbiased,  asymptotically  normally  distributed,  asymptot¬ 
ically  efficient,  and  sufficient.  Even  if  they  are  not,  a  function  of  a  sufficient  statistic 
can  always  be  found  which  has  such  properties.  This  is  the  reason  why  maximum 
likelihood  estimation  is  popular.  By  “asymptotically”  we  mean  that  the  properties 
hold  as  n  tends  to  infinity,  i.e.  as  the  sample  size  increases.  There  might  be  other 
good  estimators  in  a  particular  context,  for  example  estimators  that  are  efficient  and 
not  only  asymptotically  efficient;  however,  in  general,  the  ML  principle  is  a  great 
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choice  in  many  circumstances.  We  are  going  to  use  it  in  the  following  sections  and 
chapters,  for  instance  for  general  point  estimation  and  in  the  linear  regression  model 
(Chap.  11). 

Remark  9.3.1  More  examples  of  maximum  likelihood  estimators  are  given  in 
Exercises  9. 1-9.3. 


9.3.2  Method  of  Moments 


The  method  of  moments  is  another  well-known  method  to  derive  the  estimators  for 
population  parameters.  Below,  we  outline  this  principle  briefly  by  way  of  example. 

The  idea  is  that  the  population  parameters  of  interest  can  be  related  to  the  moments 
(e.g.  expectation,  variance)  of  the  distribution  of  the  considered  random  variables. 

A  simple  case  is  the  estimator  for  the  expected  value  E(X)  =  /iofa  population 
using  an  i.i.d.  random  sample  X  =  (X\,  . . . ,  Xn).  In  this  case,  jl  =  X  is  the  natural 
moment  estimator  of  fi.  Further,  since  E(X2)  =  a2  +  /i2,  an  estimator  of  a2  +  fi2  is 
X/=i  Xf-  Using  X2  as  an  estimator  for  /i2,  this  results  in  the  biased,  but  asymp¬ 
totically  unbiased  estimate 


1 

n 


YSXi  -  ^)2  • 

i  =  l 


An  extension  of  this  method  is  the  generalized  method  of  moments  (GMM).  GMM 
estimators  have  interesting  properties:  under  relatively  week  conditions  (not  further 
discussed  here),  they  are  consistent  and  asymptotically  normal,  as  well  as  efficient 
in  the  class  of  those  estimators  that  do  not  use  any  additional  information  besides 
the  information  included  in  the  moment  conditions.  Usually,  they  require  a  two-step 
estimating  approach  or  an  iterative  estimating  procedure. 

The  least  squares  estimator  for  a  linear  regression  model  with  i.i.d.  random 
errors,  discussed  in  detail  in  Chap.  11,  can  be  seen  as  a  special  case  of  a  GMM 
estimator. 


9.4  Interval  Estimation 
9.4.1  Introduction 

Let  us  first  consider  an  example  to  understand  what  we  mean  by  interval  estimation. 
Consider  a  situation  in  which  a  lady  wants  to  know  the  time  taken  to  travel  from  her 
home  to  the  train  station.  Suppose  she  makes  20  trips  and  notes  down  the  time  taken. 
To  get  an  estimate  of  the  expected  time,  one  can  use  the  arithmetic  mean.  Let  us  say 
x  =  25  min.  This  is  the  point  estimate  for  the  expected  travelling  time.  It  may  not  be 
appropriate  to  say  that  she  will  always  take  exactly  25  min  to  reach  the  train  station. 
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Rather  the  time  may  vary  by  a  few  minutes  each  time.  To  take  this  into  account,  the 
time  can  be  estimated  in  the  form  of  an  interval:  it  may  then  be  found  that  the  time 
varies  mostly  between  20  and  30  min.  Such  a  statement  is  more  informative.  Both 
expectation  and  variation  of  the  data  are  taken  into  account.  The  interval  (20,  30  min) 
provides  a  range  in  which  most  of  the  values  are  expected  to  lie.  We  call  this  concept 
interval  estimation. 

A  point  estimate  on  its  own  does  not  take  into  account  the  precision  of  the  estimate. 
The  deviation  between  the  point  estimate  and  the  true  parameter  (e.g.  \x  —  p\)  can  be 
substantial,  especially  when  the  sample  size  is  small.  To  incorporate  the  information 
about  the  precision  of  an  estimate  in  the  estimated  value,  a  confidence  interval  can 
be  constructed.  It  is  a  random  interval  with  lower  and  upper  bounds,  //  (X)  and 
/M(X),  such  that  the  unknown  parameter  0  is  covered  by  a  prespecified  probability 
of  at  least  l  —  a: 

Pe(h(X)  <  6  <  IU(X))  >  l -a.  (9.13) 

The  probability  1  —  a  is  called  the  confidence  level  or  confidence  coefficient,  //  (X) 
is  called  the  lower  confidence  bound  or  lower  confidence  limit  and  Iu  (X)  is  called 
the  upper  confidence  bound  or  upper  confidence  limit.  It  is  important  to  note  that 
the  bounds  are  random  and  the  parameter  is  a  fixed  value.  This  is  the  reason  why 
we  say  that  the  true  parameter  is  covered  by  the  interval  with  probability  1  —  a  and 
not  that  the  probability  that  the  interval  contains  the  parameter  is  1  —  a.  Please  note 
that  some  software  packages  use  the  term  “error  bar”  when  referring  to  confidence 
intervals. 

Frequency  interpretation  of  the  confidence  interval:  Suppose  N  independent 
samples  X^\  j  =  1,  2, . . . ,  N,  of  size  n  are  sampled  from  the  same  population 
and  N  confidence  intervals  of  the  form  [//(X(j)),  /M(X^)]  are  calculated.  If  N  is 
large  enough,  then  on  an  average  N(  1  —  a)  of  the  intervals  (9.13)  cover  the  true 
parameter. 

Example  9.4.1  Let  a  random  variable  follow  a  normal  distribution  with  p  =  10  and 
a2  =  1.  Suppose  we  draw  a  sample  of  n  =  10  observations  repeatedly.  The  sample 
will  differ  in  each  draw,  and  hence,  the  mean  and  the  confidence  interval  will  also 
differ.  The  data  sets  are  realizations  from  random  variables.  Have  a  look  at  Fig.  9.3 
which  illustrates  the  mean  and  the  95  %  confidence  intervals  for  6  random  samples. 
They  vary  with  respect  to  the  mean  and  the  confidence  interval  width.  Most  of  the 
means  are  close  top  =  10,  but  not  all.  Similarly,  most  confidence  intervals,  but  not  all, 
include  p.  This  is  the  idea  of  the  frequency  interpretation  of  the  confidence  interval: 
different  samples  will  yield  different  point  and  interval  estimates.  Most  of  the  times 
the  interval  will  cover  p ,  but  not  always.  The  coverage  probability  is  specified  by 
1  —  a,  and  the  frequency  interpretation  means  that  we  expect  that  (approximately) 
(1  —  a)  •  100%  of  the  intervals  to  cover  the  true  parameter  p.  In  that  sense,  the 
location  of  the  interval  will  give  us  some  idea  about  where  the  true  but  unknown 
population  parameter  p  lies,  while  the  length  of  the  interval  reflects  our  uncertainty 
about  p:  the  wider  the  interval  is,  the  higher  is  our  uncertainty  about  the  location 
of  p. 
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Fig.  9.3  Frequency 
interpretation  of  confidence 
intervals 
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We  now  introduce  the  following  confidence  intervals: 

•  Confidence  interval  for  the  mean  p  of  a  normal  distribution. 

•  Confidence  interval  for  the  probability  p  of  a  binomial  random  variable. 

•  Confidence  interval  for  the  odds  ratio. 


9.4.2  Confidence  Interval  for  the  Mean  of  a  Normal  Distribution 


Confidence  Interval  for  pi  When  cr2  =  cr^  is  Known. 

Let  X\,  X2,  . . . ,  Xn  be  an  i.i.d.  sample  from  a  N(p,  ctq)  distribution  where  <Tq  is 
assumed  to  be  known.  We  use  the  point  estimate  X  =  ^  X/=i  %i  t0  estimate  p 
and  construct  a  confidence  interval  around  the  mean  p.  Using  the  Central  Limit 
Theorem  (Appendix  C.3,  p.  426),  it  follows  that  X  follows  a  N(p,  cF^/n)  distribution. 
Therefore  +Jn{X  —  p)/a o  ~  N(0,  1),  and  it  follows  that 


^  ( 

where  z\~a/2  denotes  the  (1  —  a/2)  quantile  of  the  standard  normal  distribution 
N( 0,  1).  We  solve  this  inequality  for  the  unknown  p  and  get  the  desired  confidence 
interval  as  follows: 


y/n(X  p) 


cr  o 


<  Z\. 


) 


=  1  —  a 


(9.14) 


or 


<  z\ 


Zl-a/2 


< 


P  <  X  +  Zl-a/2 


=  1  —  a  . 
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The  confidence  interval  for  p  is  thus  obtained  as 


[//(X),4(X)] 


Zl—a/2 


,  x  +  Zl—a/2 


(9.15) 


This  is  known  as  (1  —  a)  %  confidence  interval  for  p  or  the  confidence  interval  for 
p  with  confidence  coefficient  a. 

We  can  use  the  R  function  qnorm  or  Table C.l  to  obtain  Z\-* ,  see  also  Sects.  8.4, 
A. 3,  and  C.7.  For  example,  for  a  =  0.05  and  a  =  0.01  we  get  zi _  «  =  zo.975  =  1.96 
and  Zj_o  =  zo.995  =  2.576  using  qnorm (0 . 975)  and  qnorm (0 . 995) .  This  gives  us 
the  quantiles  we  need  to  determine  a  95  %  and  99  %  confidence  interval,  respectively. 


Example  9.4.2  We  consider  again  Example  9.2.3  where  we  evaluated  the  weight  of 
10-year-old  children.  Assume  that  the  variance  is  known  to  be  36;  then  the  upper  and 
lower  limits  of  a  95  %  confidence  interval  for  the  expected  weight  p  can  be  calculated 
as  follows: 


<7o  \/36 

h(X)  =  X-  zi-a/2^  =  41.97  -  1.96-= 

V20 

<70  V36 

IU(X)  =  X  +  zi-a/2~i=  =  41.97  +  1.96 


39.34, 


44.59. 


'n  V20 

We  get  the  confidence  interval  [IU(X),  I0(X)]  =  [39.34,  44.59].  With  95  %  confi¬ 
dence,  the  true  parameter  p  is  covered  by  the  interval  [39.34,  44.59]. 


Confidence  Interval  for  p  When  a2  is  Unknown. 

Let  X\,  X2 ,  . . . ,  Xn  be  an  i.i.d.  sample  from  N(p,  a2)  where  cr2  is  assumed  to  be 
unknown  and  is  being  estimated  by  the  sample  variance  S\ .  We  know  from  Sect.  8.3.1 
that 

(n  -  1)5|  2 

- ^2 -  ~  *»- 1  • 

It  can  be  shown  that  X  and  S2X  are  stochastically  independent.  Thus,  we  know  that 

*Jn(X  -  /i) 

4 — 1 
Sx 

follows  a  t -distribution  with  n  —  1  degrees  of  freedom.  We  can  use  this  result  to 
determine  the  confidence  interval  for  p  as 


or 


-h-%,n-l  < 


X  -  t 


<  p  <  X  +  t 
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The  confidence  interval  for  p  is  thus  obtained  as 


[//(X),  4(X)] 


X  62  — 1;1— ck/2 


X  +  tn-  1;  1  — ot/2 


(9.16) 


which  is  the  100(1  —  a)%  confidence  interval  for  p  or  the  confidence  interval  for  p 
with  confidence  coefficient  a. 

The  interval  (9.16)  is,  in  general,  wider  than  the  interval  (9.15)  for  identical  a  and 
identical  sample  size  n ,  since  the  unknown  parameter  a2  is  estimated  by  S\  which 
induces  additional  uncertainty.  The  quantiles  for  the  t -distribution  can  be  obtained 
using  the  R  command  qt  or  Table  C.2. 


Example  9.4.3  Consider  Example  9.4.2  where  we  evaluated  the  weight  of  10-year- 
old  children.  We  have  already  calculated  the  point  estimate  of  p  as  x  =41.97.  With 
^19;0.975  =  2.093,  obtained  via  qt (0.975 ,19)  or  TableC.2,  the  upper  and  lower 
limits  of  a  95  %  confidence  interval  for  p  are  obtained  as 


sx  6.07 

IuQ. 0  =X-  t!9;0.975  ~  =  41.97  -  2.093  ■-=«  39.12  , 

y/n  V20 

sx  6.07 

Io(X)  =  x  +  ti9;0.975  •  -4=  =  41.97  +  2.093  •  -^=  «  44.81  . 


V20 


Therefore,  the  confidence  interval  is  [//(X),  IU(X)]  =  [39.13,  44.81].  In  R,  we  can 
use  the  conf  .  int  value  of  the  t .  test  command  to  get  a  confidence  interval  for  the 
mean  (see  also  Example  10.3.3  for  more  details  on  t .  test).  The  default  is  a  95  % 
confidence  interval,  but  it  can  be  changed  easily  if  desired: 


x  <-  c (40 . 2 ,  32.8,  38.2,  43.5,  ...,  49.9,  34.2) 
t . test (x , conf . level  =  0 . 95) $conf . int 
[1]  39.12384  44.80616 


There  is  no  unique  best  way  to  draw  the  calculated  confidence  intervals  in  R. 
Among  many  other  options,  one  can  simply  work  with  the  plot  functionality  or 
use  geom_errorbar  in  conjunction  with  a  ggplot  object  created  with  the  library 
ggplot2,  or  use  the  plotCI  command  in  the  library  plotrix. 


9.4.3  Confidence  Interval  for  a  Binomial  Probability 


Let  X\,  X2,  . . . ,  Xn  be  an  i.i.d.  sample  from  a  Bernoulli  distribution  B(  1,  p).  Then 
Y  =  X/=i  Xi  has  a  binomial  distribution  B(n,  p). 

We  have  already  introduced  p  as  an  estimator  for  p: 


1 

-Y. 

n 
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From  (8.8),  we  know  that  Var(y)  =  np{  1  —  p).  Applying  rule  (7.33),  the  variance 
of  the  estimator  p  is 


Var(/3)  = 


p(  i  -  p) 
n 


and  it  can  be  estimated  by 


P(  1  ~  P) 
n 


Nowadays,  the  exact  confidence  intervals  of  the  binomial  distribution  function  can 
be  easily  calculated  using  computer  implementations.  Nevertheless,  (i)  for  a  suffi¬ 
ciently  large  sample  size  n ,  (ii)  if  p  is  not  extremely  low  or  high,  and  (iii)  if  the 
condition  np(l  —  p)  >9  is  fulfilled,  we  can  use  an  approximation  based  on  the  nor¬ 
mal  distribution  to  calculate  confidence  intervals.  To  be  more  specific,  one  can  show 


that 


P~  P 

V p(l  -  p)/n 


approx. 


N(0,  1). 


(9.17) 


This  gives  us 


P  Zl—a/2- 


pa-p) 


n 


<  p  <  p  +Z\-a/2- 


p( i  -  P) 


n 


l -a,  (9.18) 


and  we  get  a  confidence  interval  for  p  as 


P  ~  Zl-a/2 


Pd-P) 


n 


P  +  Zl-a/2- 


P(  1  -  P) 


n 


(9.19) 


Example  9.4.4  We  look  again  at  Example  9.2.4  where  we  evaluated  the  proportion 
of  members  who  had  to  pay  a  penalty.  Out  of  all  borrowers,  39  %  brought  back  their 
books  late  and  thus  had  to  pay  a  fee.  A  95  %  confidence  interval  for  the  probability 
p  of  bringing  back  a  book  late  can  be  constructed  using  the  normal  approximation, 
sincenp(l  —  p)  =  100  •  0.39  •  0.61  =  23.79  >  9.  Withzi-a/2  =  zo.975  —  1.96 and 
p  =  0.39,  we  get  the  95  %  confidence  interval  as 


0.39  —  1.96, 


0.39-0.61 


100 


,  0.39+  1.96, 


0.39-0.61 


100 


[0.294,0.486]. 


In  R,  an  exact  confidence  interval  can  be  found  using  the  function  binom .test: 

binom . test (x=39 ,n=100) $conf . int 
[1]  0.2940104  0.4926855 

One  can  see  that  the  exact  and  approximate  confidence  limits  differ  slightly  due 
to  the  normal  approximation  which  approximates  the  exact  binomial  probabilities. 


9.4  Interval  Estimation 


201 


9.4.4  Confidence  Interval  for  the  Odds  Ratio 

In  Chap.  4,  we  introduced  the  odds  ratio  to  determine  the  strength  of  association 
between  two  binary  variables.  One  may  be  interested  in  the  dispersion  of  the  odds 
ratio  and  hence  calculate  a  confidence  interval  for  it.  Recall  the  notation  for  2  x  2 
contingency  tables: 


Y 

yi  y2 

Total  (row) 

X  Xl 

X2 

a  b 

c  d 

a  +  b 
c  +  d 

Total  (column) 

a  +  c  b  +  d 

n 

In  the  spirit  of  the  preceding  sections,  we  can  interpret  the  entries  in  this  con¬ 
tingency  table  as  population  parameters.  For  example,  a  describes  the  absolute  fre¬ 
quency  of  observations  in  the  population  for  which  Y  =  y\  and  X  =  x\.  If  we  have 
a  sample  then  we  can  estimate  a  by  the  number  of  observed  observations  n\\  for 
which  Y  =  y\  and  X  =  x\ .  We  can  thus  view  n\\  to  be  an  estimator  for  a,  n\2  to  be 
an  estimator  for  b,  ri2\  to  be  an  estimator  for  c,  and  ri22  to  be  an  estimator  for  d.  It 
follows  that 


OR  = 


ftnft22 


(9.20) 


fti2ft2i 

serves  as  the  point  estimate  for  the  population  odds  ratio  OR  =  ad /be.  To  construct 
a  confidence  interval  for  the  odds  ratio,  we  need  to  work  on  a  log-scale.  The  log  odds 
ratio, 


Oq  =  In  OR  =  In  a  —  In  b  —  In  c  +  In  d,  (9.21) 


takes  the  natural  logarithm  of  the  odds  ratio.  It  is  evident  that  it  can  be  estimated 
using  the  observed  absolute  frequencies  of  the  joint  frequency  distribution  of  X  and 
Y: 

'X  - — -  ^11^22 

0O  =  In  OR  =  In  J±J±  .  (9.22) 

ft  12^21 


It  can  be  shown  that  60  follows  approximately  a  normal  distribution  with  expectation 
Oo  and  standard  deviation 


(9.23) 


Following  the  reasoning  explained  in  the  earlier  section  on  confidence  intervals  for 
binomial  probabilities,  we  can  calculate  the  100(1  —  a)  %  confidence  interval  for  Oo 
under  a  normal  approximation  as  follows: 


00  -  Zi-ad^,  Oo  + 


2  Oo 


=  UuJo ] 


(9.24) 
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Since  we  are  interested  in  the  confidence  interval  of  the  odds  ratio,  and  not  the  log 
odds  ratio,  we  need  to  transform  back  the  lower  and  upper  bound  of  the  confidence 
interval  as 

[exp(/„),exp(/0)]  .  (9.25) 

Example  9.4.5  Recall  Example  4.2.5  from  Chap.  4  where  we  were  interested  in  the 
association  of  smoking  with  a  particular  disease.  The  data  is  summarized  in  the 
following  2x2  contingency  table: 


Smoking 
Yes  No 

Total  (row) 

Yes 

Disease 

No 

34  66 

100 

22  118 

140 

Total  (column) 

56  184 

240 

The  odds  ratio  was  estimated  to  be  2.76,  and  we  therefore  concluded  that  the  chances 
of  having  the  particular  disease  is  2.76  times  higher  for  smokers  compared  with  non- 

/V 

smokers.  To  calculate  a  95  %  confidence  intervals,  we  need  Oo  =  ln(2.76),  z\ 

1.96  and 


Ct 


^0  _ 


1 

1 

1 

■  + 

+ 

n  it 

nil 

n\2 

1 

1 

1 

T 

+ 

T 

34 

118 

66 

n  21/ 

±y 

22/ 


0.314. 


The  confidence  interval  for  the  log  odds  ratio  is 

[ln(2.76)  -  1.96  •  0.314,  ln(2.76)  +  1.96  •  0.314]  «  [0.40,  1.63] 


Exponentiation  of  the  confidence  interval  bounds  yields  the  95  %  confidence  interval 
for  the  odds  ratio  as 


[1.49,5.11]. 

There  are  many  ways  to  obtain  the  same  results  in  R.  One  option  is  to  use  the 
oddsratio  function  of  the  library  epitools.  Note  that  we  need  to  specify  “wald” 
under  the  methods  option  to  get  confidence  intervals  which  use  the  normal  approxi¬ 
mation  as  we  did  in  this  case. 


library (epitools) 

smd  <-  matrix (c (34, 22, 66, 118) ,ncol=2 ,nrow=2)  #data 
oddsratio (smd,method= ' wald ' ) 
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9.5  Sample  Size  Determinations 

Confidence  intervals  help  us  estimating  the  precision  of  point  estimates.  What  if 
we  are  required  to  adhere  to  a  prespecified  precision  level?  We  know  that  the  vari¬ 
ance  decreases  as  the  sample  size  increases.  In  turn,  confidence  intervals  become 
narrower.  On  the  other  hand,  increasing  the  sample  size  has  its  own  consequences. 
For  example,  the  cost  and  time  involved  in  setting  up  experiments,  or  conducting 
a  survey,  increases.  In  these  situations  it  is  important  to  find  a  balance  between  the 
variability  of  the  estimates  and  the  sample  size.  We  cannot  control  the  variability  in 
the  data  in  most  of  the  situations,  but  it  is  possible  to  control  the  sample  size  and 
therefore  the  precision  of  our  estimates.  For  example,  we  can  control  the  number 
of  people  to  be  interviewed  in  a  survey — given  the  resources  which  are  available. 
We  discuss  how  to  determine  the  number  of  observations  needed  to  get  a  particular 
precision  (length)  of  the  confidence  interval.  We  find  the  answers  to  such  questions 
using  the  formulae  for  confidence  intervals. 


Sample  Size  Calculation  for  fi. 


Let  us  consider  the  situation  where  we  are  interested  in  estimating  the  population 
mean  /i.  The  length  of  the  confidence  interval  (9.15)  for  the  point  estimate  X  is 


2zi-a/2 


(9.26) 


We  would  now  like  to  fix  the  width  of  the  confidence  interval  and  come  up  with 
a  sample  size  which  is  required  to  achieve  this  width.  Let  us  fix  the  length  of  the 
confidence  interval  as 

A=2Zl-a/2^-  (9.27) 

Jn 


Assume  we  have  knowledge  of  <jo.  The  knowledge  about  <jo  can  be  obtained,  for 
example,  through  a  pilot  study  or  past  experience  with  the  experiment.  We  are  inter¬ 
ested  in  obtaining  the  value  of  n  for  which  a  confidence  interval  has  a  fixed  confidence 
width  of  A  or  less.  Rearranging  (9.27)  gives  us 


n  > 


Z\-g/2Cro 

A 


-i  2 


This  means  a  minimum  or  optimum  sample  size  is 


ftopt  — 


Zl— a/20’0 

A 


-i2 


(9.28) 


(9.29) 


The  sample  size  nopt  ensures  that  the  1  —  a  confidence  interval  for  fi  has  at  most 
length  A.  But  note  that  we  have  assumed  that  cro  is  known.  If  we  do  not  know  <jo 
(which  is  more  likely  in  practice),  we  have  to  make  an  assumption  about  it,  e.g.  by 
using  an  estimate  from  a  former  study,  a  pilot  study,  or  other  external  information. 
Practically,  (9.28)  is  used  in  the  case  of  known  and  unknown  ctq. 
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Example  9.5.1  A  call  centre  is  interested  in  determining  the  expected  length  of  a 
telephone  call  as  precisely  as  possible.  The  requirements  are  that  the  95  %  confidence 
interval  for  p  should  have  a  width  of  1  min.  Suppose  that  the  call  centre  has  developed 
a  pilot  study  in  which  <jo  was  estimated  to  be  5  min.  The  sample  size  n  that  is  needed 
to  estimate  the  expected  length  of  the  phone  calls  with  the  desired  precision  is: 


n  > 


2zi-a/2Cro 

2 

"2  x  1.96  x  5' 

A 

1 

384. 


This  means  that  at  least  384  calls  are  required  to  get  the  desired  confidence  interval 
width. 


Sample  Size  Calculation  for  p. 


We  can  follow  the  earlier  reasoning  and  determine  the  optimum  sample  size  for  a 
specific  confidence  interval  width  using  the  confidence  interval  definition  (9.19). 
Since  the  width  of  the  confidence  interval  is 


we  get 


A  =  2z\-ol/2 


n  > 


"2  £l-a/2 

A 


2 


(9.30) 


Example  9.5.2  A  factory  may  be  interested  in  the  probability  of  an  error  in  an  oper¬ 
ating  process.  The  length  of  the  confidence  interval  should  be  ±2  %,  i.e.  A  =  0.04. 
Suppose  it  is  speculated  that  the  error  probability  is  10  %;  we  may  then  use  p  =  0.1 
as  our  prior  judgment  for  the  true  value  of  p.  This  yields 


n  > 


2  £i-a/2 

A 


2 


2  x 


1.96 

004 


~i2 


0.1  •  (1 


0.1)  «  865.  (9.31) 


This  means  we  need  a  sample  size  of  at  least  865  to  obtain  the  desired  width  of  the 
confidence  interval  for  p. 


The  above  examples  for  both  p  and  p  have  shown  us  that  without  external  knowl¬ 
edge  about  the  research  question  of  interest,  it  is  difficult  to  come  up  with  an  appro¬ 
priate  sample  size.  Results  may  vary  considerably  depending  on  what  type  of  infor¬ 
mation  is  assumed  to  be  known.  With  limited  knowledge,  it  can  be  useful  to  report 
results  for  different  widths  of  confidence  intervals  and  hypothesized  values  of  p  or 

(TO- 

Sample  size  calculations  can  be  highly  complex  in  many  practical  situations  and 
may  not  remain  as  simple  as  in  the  examples  considered  here.  For  example,  Chap.  10 
uses  additional  concepts  in  the  context  of  hypothesis  testing,  such  as  the  power, 
which  can  be  taken  into  consideration  when  estimating  sample  sizes.  However, 
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in  this  case,  calculations  and  interpretations  become  more  difficult  and  complex. 
A  detailed  overview  of  sample  size  calculations  can  be  found  in  Chow  et  al.  (2007) 
and  Bock  (1997). 


9.6  Key  Points  and  Further  Issues 


Note: 


/  We  have  introduced  important  point  estimates  for  the  parameters  of  a  nor¬ 
mal  and  a  binomial  distribution: 

v  for  /i,  S 2  for  a2,  x  for  p. 

In  general,  the  choice  of  these  point  estimates  is  not  arbitrary  but  follows 
some  principles  of  statistical  inference  such  as  maximum  likelihood  esti¬ 
mation,  or  least  squares  estimation  (introduced  in  Chap.  11). 

/  The  maximum  likelihood  estimator  is  usually  consistent,  asymptotically 
unbiased,  asymptotically  normally  distributed,  and  asymptotically  effi¬ 
cient. 

/  The  validity  of  all  results  in  this  chapter  depends  on  the  assumption  that 
the  data  is  complete  and  has  no  missing  values.  Incomplete  data  may  yield 
different  conclusions. 

/  A  confidence  interval  is  defined  in  terms  of  upper  and  lower  confidence 
limits  and  covers  the  true  target  parameter  with  probability  l  —  a.  Confi¬ 
dence  intervals  are  often  constructed  as  follows: 

point  estimate  ±  quantile  •  ^/variance  of  point  estimate  . 

" - V - ' 

standard  error 

/  More  detailed  introductions  to  inference  are  presented  in  Casella  and 
Berger  (2002)  and  Young  and  Smith  (2005). 


9.7  Exercises 

Exercise  9.1  Consider  an  i.i.d.  sample  of  size  n  from  a  Po(A)  distributed  random 
variable  X. 
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(a)  Determine  the  maximum  likelihood  estimate  for  A. 

(b)  What  does  the  log-likelihood  function  look  like  for  the  following  realizations: 
xi  =  4,  X2  =  3,  X3  =  8,  X4  =  6,  X5  =  6?  Plot  the  function  using  R.  Hint:  The 
curve  command  can  be  used  to  plot  functions. 

(c)  Use  the  Neyman-Fisher  Factorization  Theorem  to  argue  that  the  maximum  like¬ 
lihood  estimate  obtained  in  (a)  is  a  sufficient  statistic  for  A. 

Exercise  9.2  Consider  an  i.i.d.  sample  of  size  n  from  a  N(/i,  a2)  distributed  random 
variable  X. 

(a)  Determine  the  maximum  likelihood  estimator  for  fi  under  the  assumption  that 
a2  =  1. 

(b)  Now  determine  the  maximum  likelihood  estimator  for  fi  for  an  arbitrary  a2. 

(c)  What  is  the  maximum  likelihood  estimate  for  cr2? 

Exercise  9.3  Let  X\,  X2,  . . . ,  Xn  be  n  i.i.d.  random  variables  which  follow  a  uni¬ 
form  distribution,  U (0,0).  Write  down  the  likelihood  function  and  argue,  without 
differentiating  the  function,  what  the  maximum  likelihood  estimate  of  6  is. 

Exercise  9.4  Let  X \ ,  X2, . . . ,  Xn  be  n  i.i.d.  random  variables  which  follow  an  expo¬ 
nential  distribution.  An  intelligent  statistician  proposes  to  use  the  following  two 
estimators  to  estimate  fi  =  1/A: 

(i)  Tn(X)  =  nXmin  with  Xmin  =  min(Xi,  . . . ,  Xn)  and  Xmin  ~  Exp(nA), 

(ii)  VAX)  =  n~l  X:L|  Xi. 

(a)  Are  both  Tn(X)  and  Vn(X)  (asymptotically)  unbiased  for  fit 

(b)  Calculate  the  mean  squared  error  of  both  estimators.  Which  estimator  is  more 
efficient? 

(c)  Is  Vn  (X)  MSE  consistent,  weakly  consistent,  both,  or  not  consistent  at  all? 

Exercise  9.5  A  national  park  in  Namibia  determines  the  weight  (in  kg)  of  a  sample 
of  common  eland  antelopes: 

450  730  700  600  620  660  850  520  490  670  700  820 
910  770  760  620  550  520  590  490  620  660  940  790 


Calculate 

(a)  the  point  estimate  of  fi  and  a2  and 

(b)  the  confidence  interval  for  fi  (a  =  0.05). 

under  the  assumption  that  the  weight  is  normally  distributed. 

(c)  Use  R  to  reproduce  the  results  from  (b). 
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Exercise  9.6  We  are  interested  in  the  heights  of  the  players  of  the  two  basketball 
teams  “Brose  Baskets  Bamberg”  and  “Bayer  Giants  Leverkusen”  as  well  as  the 
football  team  “SV  Werder  Bremen”.  The  following  summary  statistics  are  given: 


N  Minimum  Maximum  Mean  Std.  dev. 

Bamberg 

Leverkusen 

Bremen 

16  185  211  199.06  7.047 

14  175  210  196.00  9.782 

23  178  195  187.52  5.239 

Calculate  a  95  %  confidence  interval  for  p  for  all  three  teams  and  interpret  the  results. 

Exercise  9.7  A  married  couple  tosses  a  coin  after  each  dinner  to  determine  who  has 
to  wash  the  dishes.  If  the  coin  shows  “head”,  then  the  husband  has  to  wash  the  dishes, 
and  if  the  coin  shows  “tails”,  then  the  wife  has  to  wash  the  dishes.  After  98  dinners, 
the  wife  notes  that  the  coin  has  shown  head  59  times. 

(a)  Estimate  the  probability  that  the  wife  has  to  wash  the  dishes. 

(b)  Calculate  and  interpret  the  95  %  confidence  interval  for  p. 

(c)  How  many  dinners  are  needed  to  estimate  the  true  probability  for  the  coin  show¬ 
ing  “head”  with  a  precision  of  ±0.5  %  under  the  assumption  that  the  coin  is 
fair? 

Exercise  9.8  Suppose  93  out  of  104  pupils  have  passed  the  final  examination  at  a 
certain  school. 

(a)  Calculate  a  95  %  confidence  interval  for  the  probability  of  failing  the  examination 
both  by  manual  calculations  and  by  using  R ,  and  compare  the  results. 

(b)  At  county  level  3.2%  of  pupils  failed  the  examination.  Are  the  school’s  pupils 
worse  than  those  in  the  whole  county? 

Exercise  9.9  To  estimate  the  audience  rate  for  several  TV  stations,  3000  households 
are  asked  to  allow  a  device,  which  records  which  TV  station  is  watched,  to  be 
installed  on  their  TVs.  2500  agreed  to  participate.  Assume  it  is  of  interest  to  estimate 
the  probability  of  someone  switching  on  the  TV  and  watching  the  show  “Germany’s 
next  top  model”. 


(a)  What  is  the  precision  with  which  the  probability  can  be  estimated? 

(b)  What  source  of  bias  could  potentially  influence  the  estimates? 
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Exercise  9.10  An  Olympic  decathlon  athlete  is  interested  in  his  performance  com¬ 
pared  with  the  performance  of  other  athletes.  He  is  a  good  runner  and  interested  in 
his  100  m  results  compared  with  those  of  other  athletes. 

(a)  He  uses  the  decathlon  data  from  this  book  (Appendix  A. 2)  to  come  up  with 
a  =  s  =  0.233.  What  sample  size  does  he  need  to  calculate  a  95  %  confidence 
interval  for  the  mean  running  time  which  is  precise  to  ±0.1  s? 

(b)  Calculate  a  95  %  confidence  interval  for  the  mean  running  time  (x  =  10.93)  of 
the  30  athletes  captured  in  the  data  set  in  Chap.  A. 2.  Interpret  the  width  of  this 
interval  compared  with  the  width  determined  in  a). 

(c)  The  runner’s  own  best  time  is  10.86  s.  He  wants  to  be  among  the  best  10  %  of  all 
athletes.  Calculate  an  appropriate  confidence  interval  to  compare  his  time  with 
the  10%  best  times. 

Exercise  9.11  Consider  the  pizza  delivery  data  described  in  Chap.  A.4.  We  distin¬ 
guish  between  pizzas  delivered  on  time  (i.e.  in  less  than  30  min)  and  not  delivered  on 
time  (i.e.  in  more  than  30  min).  The  contingency  table  for  delivery  time  and  operator 
looks  as  follows: 


Operator 

Total 

Laura 

Melissa 

<30  min 

163 

151 

314 

>30  min 

475 

477 

952 

Total 

638 

628 

1266 

(a)  Calculate  and  interpret  the  odds  ratio  and  its  95  %  confidence  interval. 

(b)  Reproduce  the  results  from  (a)  using  R. 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  384 


Hypothesis  Testing 


10.1  Introduction 

We  introduced  point  and  interval  estimation  of  parameters  in  the  previous  chapter. 
Sometimes,  the  research  question  is  less  ambitious  in  the  sense  that  we  are  not  inter¬ 
ested  in  precise  estimates  of  a  parameter,  but  we  only  want  to  examine  whether 
a  statement  about  a  parameter  of  interest  or  the  research  hypothesis  is  true  or  not 
(although  we  will  see  later  in  this  chapter  that  there  is  a  connection  between  confi¬ 
dence  intervals  and  statistical  tests,  called  duality).  Another  related  issue  is  that  once 
an  analyst  estimates  the  parameters  on  the  basis  of  a  random  sample,  (s)he  would 
like  to  infer  something  about  the  value  of  the  parameter  in  the  population.  Statisti¬ 
cal  hypothesis  tests  facilitate  the  comparison  of  estimated  values  with  hypothetical 
values. 

Example  10.1.1  As  a  simple  example,  consider  the  case  where  we  want  to  find  out 
whether  the  proportion  of  votes  for  a  party  P  in  an  election  will  exceed  30  %  or 
not.  Typically,  before  the  election,  we  will  try  to  get  representative  data  about  the 
election  proportions  for  different  parties  (e.g.  by  telephone  interviews)  and  then  make 
a  statement  like  “yes”,  we  expect  that  P  will  get  more  than  30  %  of  the  votes  or  “no”, 
we  do  not  have  enough  evidence  that  P  will  get  more  than  30  %  of  the  votes.  In 
such  a  case,  we  will  only  know  after  the  election  whether  our  statement  was  right  or 
wrong.  Note  that  the  term  representative  data  only  means  that  the  sample  is  similar 
to  the  population  with  respect  to  the  distributions  of  some  key  variables,  e.g.  age, 
gender,  and  education.  Since  we  use  one  sample  to  compare  it  with  a  fixed  value 
(30  %),  we  call  it  a  one-sample  problem. 

Example  10.1.2  Consider  another  example  in  which  a  clinical  study  is  conducted 
to  compare  the  effectiveness  of  a  new  drug  (B)  to  an  established  standard  drug  (A) 
for  a  specific  disease,  for  example  too  high  blood  pressure.  Assume  that,  as  a  first 
step,  we  want  to  find  out  whether  the  new  drug  causes  a  higher  reduction  in  blood 
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pressure  than  the  already  established  older  drug.  A  frequently  used  study  design  for 
this  question  is  a  randomized  (i.e.  patients  are  randomly  allocated  to  one  of  the  two 
treatments)  controlled  clinical  trial  (double  blinded,  i.e.  neither  the  patient  nor  the 
doctor  know  which  of  the  drugs  a  patient  is  receiving  during  the  trial),  conducted  in 
a  fixed  time  interval,  say  3  months.  A  possible  hypothesis  is  that  the  average  change 
in  the  blood  pressure  in  group  B  is  higher  than  in  group  A,  i.e.  5b  >  5a  where 
Sj  =  /a jo  —  fij 3,  j  =  A,  B  and  fijo  is  the  average  blood  pressure  at  baseline  before 
measuring  the  blood  pressure  again  after  3  months  {(ij 3).  Note  that  we  expect  both 
the  differences  5  a  and  5b  to  be  positive,  since  otherwise  we  would  have  some  doubt 
that  either  drug  is  effective  at  all.  As  a  second  step  (after  statistically  proving  our 
hypothesis),  we  are  interested  in  whether  the  improvement  of  B  compared  to  A  is 
relevant  in  a  medical  or  biological  sense  and  is  valid  for  the  entire  population  or 
not.  This  will  lead  us  again  to  the  estimation  problems  of  the  previous  chapter,  i.e. 
quantifying  an  effect  using  point  and  interval  estimation.  Since  we  are  comparing  two 
drugs,  we  need  to  have  two  samples  from  each  of  the  drugs;  hence,  we  have  a  two- 
sample  problem.  Since  the  patients  receiving  A  are  different  from  those  receiving 
B  in  this  example,  we  refer  to  it  as  a  “two-independent-samp\es  problem”. 

Example  10.1.3  In  another  example,  we  consider  an  experiment  in  which  a  group 
of  students  receives  extra  mathematical  tuition.  Their  ability  to  solve  mathematical 
problems  is  evaluated  before  and  after  the  extra  tuition.  We  are  interested  in  knowing 
whether  the  ability  to  solve  mathematical  problems  increases  after  the  tuition,  or  not. 
Since  the  same  group  of  students  is  used  in  a  pre-post  experiment,  this  is  called  a 
“two-dependent- samples  problem”  or  a  “paired  data  problem”. 


1 0.2  Basic  Definitions 
1 0.2.1  One-  and  Two-Sample  Problems 

In  one-sample  problems,  the  data  is  usually  assumed  to  arise  as  one  sample  from  a 
defined  population.  In  two-sample  problems,  the  data  originates  in  the  form  of  two 
samples  possibly  from  two  different  populations.  The  heterogeneity  is  often  modelled 
by  assuming  that  the  two  populations  only  differ  in  some  parameters  or  key  quantities 
such  as  expectation  (i.e.  mean),  median,  or  variance.  As  in  our  introductory  example, 
the  samples  can  either  be  independent  (as  in  the  drug  Example  10.1.2)  or  dependent 
(as  in  the  evaluation  Example  10.1.3). 


10.2.2  Hypotheses 

A  researcher  may  have  a  research  question  for  which  the  truth  about  the  population 
of  interest  is  unknown.  Suppose  data  can  be  obtained  using  a  survey,  observation,  or 
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an  experiment:  if,  given  a  prespecified  uncertainty  level,  a  statistical  test  based  on 
the  data  supports  the  hypothesis  about  the  population,  we  say  that  this  hypothesis  is 
statistically  proven.  Note  that  the  research  question  has  to  be  operationalized  before 
it  can  be  tested  by  a  statistical  test.  Consider  the  drug  Example  10.1.2:  we  want  to 
examine  whether  the  new  drug  B  has  a  greater  blood  pressure  lowering  effect  than 
the  standard  drug  A.  We  have  several  options  to  operationalize  this  research  question 
into  a  statistical  set-up.  One  is  to  test  whether  the  average  reduction  (from  baseline  to 
3  months)  of  the  blood  pressure  is  higher  (and  positive)  for  drug  B  than  drug  A.  We 
then  state  our  hypotheses  in  terms  of  expected  values  (i.e.  (i).  Why  do  we  have  to  use 
the  expected  values  fi  and  not  simply  compare  the  arithmetic  means  v  ?  The  reason  is 
that  the  superiority  of  B  shown  in  the  sample  will  only  be  valid  for  this  sample  and 
not  necessarily  for  another  sample.  We  need  to  show  the  superiority  of  B  in  the  entire 
population,  and  hence,  our  hypothesis  needs  to  reflect  this.  Another  option  would 
be,  for  example,  to  use  median  changes  in  blood  pressure  values  instead  of  mean 
changes  in  blood  pressure  values.  An  important  point  is  that  the  research  hypothesis 
which  we  want  to  prove  has  to  be  formulated  as  the  statistical  alternative  hypothesis, 
often  denoted  by  H\ .  The  reason  for  this  will  become  clearer  later  in  this  chapter. 
The  opposite  of  the  research  hypothesis  has  to  be  formulated  as  the  statistical  null 
hypothesis,  denoted  by  Ho.  In  the  drug  example,  the  alternative  and  null  hypotheses 
are,  respectively, 

H\  :  5b  > 


and 


Ho  :  5b  <  5a. 


We  note  that  the  two  hypotheses  are  disjoint  and  the  union  of  them  covers  all  possible 
differences  of  5  b  and  5a-  There  is  a  boundary  value  (5b  =  5a)  which  separates  the 
two  hypotheses.  Since  we  want  to  show  the  superiority  of  B ,  the  hypothesis  was 
formulated  as  a  one-sided  hypothesis.  Note  that  there  are  different  ways  to  formulate 
two- sample  hypotheses;  for  example,  H\  :5b  >  5a  is  equivalent  to  H\  :  5b  —  5a  > 
0.  In  fact,  it  is  very  common  to  formulate  two- sample  hypotheses  as  differences, 
which  we  will  see  later  in  this  chapter. 


1 0.2.3  One-  and  Two-Sided  Tests 


We  distinguish  between  one-sided  and  two-sided  hypotheses  and  tests.  In  the  previous 
section,  we  gave  an  example  of  a  one-sided  test. 

For  an  unknown  population  parameter  0  (e.g.  (i)  and  a  fixed  value  Oo  (e.g.  5),  the 
following  three  cases  have  to  be  distinguished: 


Case 

Null  hypothesis 

Alternative  hypothesis 

(a) 

o 

II 

0  ^  Oo 

Two-sided  test  problem 

(b) 

0  >  Oo 

0  <  Oo 

One-sided  test  problem 

(c) 

0  <  Oq 

0  >  Oq 

One-sided  test  problem 
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Example  10.2.1  One-sample  problems  often  test  whether  a  target  value  is  achieved 
or  not.  For  example,  consider  the  null  hypothesis  as 

•  Hq  :  average  filling  weight  of  packages  of  flour  =  1  kg 

•  Ho  :  average  body  height  (men)  =  178  cm. 

The  alternative  hypothesis  H\  is  formulated  as  deviation  from  the  target  value.  If 
deviations  in  both  directions  are  interesting,  then  H\  is  formulated  as  a  two-sided 
hypothesis, 

•  H\  :  average  body  height  (men)  ^  178  cm. 

If  deviations  in  a  specific  direction  are  the  subject  of  interest,  then  Hi  is  formulated 
as  a  one-sided  hypothesis,  for  example, 

•  H\  :  average  filling  weight  of  flour  packages  is  lower  than  1  kg. 

•  H\  :  average  filling  weight  of  flour  packages  is  greater  than  1  kg. 

Two-sample  problems  often  examine  differences  of  two  samples.  Suppose  the 
null  hypothesis  Ho  is  related  to  the  average  weight  of  flour  packages  filled  by  two 
machines,  say  1  and  2.  Then,  the  null  hypothesis  is 

•  Ho  :  average  weight  of  flour  packages  filled  by  machine  1  =  average  weight  of 
flour  packages  filled  by  machine  2. 

Then,  H\  can  be  formulated  as  a  one-sided  or  two-sided  hypothesis.  If  we  want  to 
prove  that  machine  1  and  machine  2  have  different  filling  weights,  then  H\  would 
be  formulated  as  a  two-sided  hypothesis 

•  H\  :  average  filling  weight  of  machine  1  ^  average  filling  weight  of  machine  2. 

If  we  want  to  prove  that  machine  1  has  lower  average  filling  weight  than  machine  2, 
H\  would  be  formulated  as  a  one-sided  hypothesis 

•  Hi  :  average  filling  weight  of  machine  1  <  average  filling  weight  of  machine  2. 

If  we  want  to  prove  that  machine  2  has  lower  filling  weight  than  machine  1,  Hi 
would  be  formulated  as  a  one-sided  hypothesis 

•  Hi  :  average  filling  weight  of  machine  1  >  average  filling  weight  of  machine  2. 

Remark  10.2.1  Note  that  we  have  not  considered  the  following  situation:  Ho  :  0  / 
#o,  H\  :  0  =  6o.  In  general,  with  the  tests  described  in  this  chapter,  we  cannot  prove 
the  equality  of  a  parameter  to  a  predefined  value  and  neither  can  we  prove  the 
equality  of  two  parameters,  as  in  Ho  :0 1  ^  O2,  Hi  :  61  =  O2.  We  can,  for  example, 
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not  prove  (statistically)  that  machines  1  and  2  in  the  previous  example  provide  equal 
filling  weight.  This  would  lead  to  the  more  complex  class  of  equivalence  tests,  which 
is  a  topic  beyond  the  scope  of  this  book. 


1 0.2.4  Type  I  and  Type  II  Error 

If  we  undertake  a  statistical  test,  two  types  of  error  can  occur. 

•  The  hypothesis  Ho  is  true  but  is  rejected;  this  error  is  called  type  I  error. 

•  The  hypothesis  Ho  is  not  rejected  although  it  is  wrong;  this  is  called  type  II  error. 

When  a  hypothesis  is  tested,  then  the  following  four  situations  are  possible: 


Hq  is  true  Hq  is  not  true 

Ho  is  not  rejected 
Hq  is  rejected 


Correct  decision  Type  II  error 
Type  I  error  Correct  decision 


The  significance  level  is  the  probability  of  type  I  error,  P(H\\Ho)  =  a ,  which  is  the 
probability  of  rejecting  Ho  (accepting  H\)  if  Ho  is  true.  If  we  construct  a  test,  the 
significance  level  a  is  prespecified,  e.g.  a  =  0.05.  A  significance  test  is  constructed 
such  that  the  probability  of  a  type  I  error  does  not  exceed  a  while  the  probability  of  a 
type  II  error  depends  on  the  true  but  unknown  parameter  values  in  the  population(s) 
and  the  sample  size.  Therefore,  the  two  errors  are  not  symmetrically  treated  in  a 
significance  test.  In  fact,  the  type  II  error  (3 ,  P(Ho\H\)  =  (3  is  not  controlled  by 
the  construction  of  the  test  and  can  become  very  high,  sometimes  up  to  l  —  a. 
This  is  the  reason  why  a  test  not  rejecting  Ho  is  not  a  (statistical)  proof  of  Ho. 
In  mathematical  statistics,  one  searches  for  the  best  test  which  maintains  a  and 
minimizes  /3.  Minimization  of  both  a  and  /3  simultaneously  is  not  possible.  The 
reason  is  that  when  a  increases  then  j3  decreases  and  vice  versa.  So  one  of  the  errors 
needs  to  be  fixed  and  the  other  error  is  minimized.  Consequently,  the  error  which 
is  considered  more  serious  is  fixed  and  then  the  other  error  is  minimized.  The  tests 
discussed  in  the  below  sections  are  obtained  based  on  the  assumption  that  the  type  I 
error  is  more  serious  than  the  type  II  error.  So  the  test  statistics  are  obtained  by  fixing 
a  and  then  minimizing  /3.  In  fact,  the  null  hypothesis  is  framed  in  such  a  way  that 
it  implies  that  the  type  I  error  is  more  serious  than  the  type  II  error.  The  probability 
1  —  (3  =  P(H\\H\)  is  called  the  power  of  the  test.  It  is  the  probability  of  making  a 
decision  in  favour  of  the  research  hypothesis  H\,  if  it  is  true,  i.e.  the  probability  of 
detecting  a  correct  research  hypothesis. 
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1 0.2.5  How  to  Conduct  a  Statistical  Test 

In  general,  we  can  follow  the  steps  described  below  to  test  a  hypothesis  about  a 

population  parameter  based  on  a  sample  of  data. 

(1)  Define  the  distributional  assumptions  for  the  random  variables  of  interest,  and 
specify  them  in  terms  of  population  parameters  (e.g.  0  or  /i  and  a).  This  is  neces¬ 
sary  for  parametric  tests.  There  are  other  types  of  tests,  so-called  nonparametric 
tests,  where  the  assumptions  can  be  relaxed  in  the  sense  that  we  do  not  have  to 
specify  a  particular  distribution,  see  Sect.  10.6ff.  Moreover,  for  some  tests  the 
distributional  assumptions  can  be  relaxed  if  the  sample  size  is  large. 

(2)  Formulate  the  null  hypothesis  and  the  alternative  hypothesis  as  described  in 
Sects.  10.2.2  and  10.2.3. 

(3)  Fix  a  significance  value  (often  called  type  I  error)  a ,  for  example  a  =  0.05,  see 
also  Sect.  10.2.4. 

(4)  Construct  a  test  statistic  T (X)  =  T(X\,  X2,  . . . ,  Xn).  The  distribution  of  T  has 
to  be  known  under  the  null  hypothesis  Ho.  We  note  again  that  (X\,  X2, 

Xn)  refers  to  the  random  variables  before  drawing  the  actual  sample  and 
x\ ,  X2,  •  •  • ,  xn  are  the  realized  values  (observations)  in  the  sample. 

(5)  Construct  a  critical  region  K  for  the  statistic  T ,  i.e.  a  region  where — if  T  falls 
in  this  region — Ho  is  rejected,  such  that 

PHo(T(X)eK)<a. 

The  notation  Ph0  (•)  means  that  this  inequality  must  hold  for  all  parameter  values 
0  that  belong  to  the  null  hypothesis  Ho.  Since  we  assume  that  we  know  the 
distribution  of  T  (X)  under  Ho ,  the  critical  region  is  defined  by  those  values  of 
T  (X)  which  are  unlikely  (i.e.  with  probability  of  less  than  a)  to  be  observed 
under  the  null  hypothesis.  Note  that  although  T  (X)  is  a  random  variable,  K  is  a 
well-defined  region,  see  Fig.  10.1  for  an  example. 

(6)  Calculate  t(x)  =  T(x  1,  X2,  . . . ,  xn)  based  on  the  realized  sample  values  X\  = 

X\,  X2  —  X2 ,  •  •  •  ,  Xn  —  Xyi  . 

(7)  Decision  rule:  if  t(x)  falls  into  the  critical  region  K ,  the  null  hypothesis  Ho 
is  rejected.  The  alternative  hypothesis  is  then  statistically  proven.  If  t(x)  falls 
outside  the  critical  region,  Ho  is  not  rejected. 

t{x)  e  K  :  Ho  rejected  =>  H\  is  statistically  significant, 
t(x)  £  K  :  Ho  not  rejected  and  therefore  accepted. 

The  next  two  paragraphs  show  how  to  arrive  at  the  test  decisions  from  step  7  in 
a  different  way.  Readers  interested  in  an  example  of  a  statistical  test  may  jump 
to  Sect.  10.3.1  and  possibly  also  Example  10.3.1. 
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1 0.2.6  Test  Decisions  Using  the  p-\falue 

Statistical  software  usually  does  not  show  us  all  the  steps  of  hypothesis  testing  as 
outlined  in  Sect.  10.2.5.  It  is  common  that  instead  of  calculating  and  reporting  the 
critical  values,  the  test  statistic  is  printed  together  with  the  so-called  p-value.  It  is 
possible  to  use  the  p- value  instead  of  critical  regions  for  making  test  decisions.  The 
p- value  of  the  test  statistic  T  (X)  is  defined  as  follows: 

two-sided  case:  Ph0(\T\  >  t(x ))  =  p-w alue 
one-sided  case:  Ph0(T  >  t(x ))  =  p-w  alue 

Ph0(T  <  t(x))  =  p- value 

It  can  be  interpreted  as  the  probability  of  observing  results  equal  to,  or  more  extreme 
than  those  actually  observed  if  the  null  hypothesis  was  true.  Then,  the  decision  rule 
is 

Ho  is  rejected  if  the  p-w  alue  is  smaller  than  the  prespecified  significance  level  a. 

Otherwise,  Ho  cannot  be  rejected. 

Example  10.2.2  Assume  that  we  are  dealing  with  a  two-sided  test  and  assume  further 
that  the  test  statistic  T(x)  is  N(0,  1) -distributed  under  Ho.  The  significance  level  is 
a  =  0.05.  If  we  observe,  for  example,  t  =  3,  then  the  p-value  is  Ph0(\T\  >  3).  This 
can  be  calculated  in  R  as 

2* (l-pnorm(3) ) 

because  pnormO  is  used  to  calculate  P(X  <  x),  and  therefore,  1-pnormO  can  be 
used  to  calculate  P(X  >  x).  We  have  to  multiply  with  two  because  we  are  deal¬ 
ing  with  a  two-sided  hypothesis.  The  result  is  p  =  0.002699796.  Therefore,  Ho  is 
rejected.  The  one-sided  p-w  alue  is  half  of  the  two-sided  p-w  alue,  i.e.  P(T  >  3)  = 
P(T  <  3)  =  0.001349898,  and  is  not  necessarily  reported  by  R.  It  is  therefore  impor¬ 
tant  to  look  carefully  at  the  R  output  when  dealing  with  one-sided  hypotheses. 

The  p-value  is  sometimes  also  called  the  significance ,  although  we  prefer  the  term 
p -value.  We  use  the  term  significance  only  in  the  context  of  a  test  result:  a  test  is 
(statistically)  significant  if  (and  only  if)  Ho  can  be  rejected. 

Unfortunately,  the  p -value  is  often  over-interpreted:  both  a  test  and  the  p -value 
can  only  provide  a  yes/no  decision:  either  Ho  is  rejected  or  not.  Interpreting  the  p- 
value  as  the  probability  that  the  null  hypothesis  is  true  is  wrong!  It  is  also  incorrect  to 
say  that  the  p-value  is  the  probability  of  making  an  error  during  the  test  decision.  In 
our  (frequentist)  context,  hypotheses  are  true  or  false  and  no  probability  is  assigned 
to  them.  It  can  also  be  misleading  to  speak  of  “highly  significant”  results  if  the  p- 
value  is  very  small.  A  last  remark:  the  p -value  itself  is  a  random  variable:  under  the 
null  hypothesis,  it  follows  a  uniform  distribution,  i.e.  p  ~  U( 0,  1). 
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1 0.2.7  Test  Decisions  Using  Confidence  Intervals 

There  is  an  interesting  and  useful  relationship  between  confidence  intervals  and 
hypothesis  tests.  If  the  null  hypothesis  Ho  is  rejected  at  the  significance  level  a ,  then 
there  exists  a  100(1  —  a)  %  confidence  interval  which  yields  the  same  conclusion  as 
the  test:  if  the  appropriate  confidence  interval  does  not  contain  the  value  ho  targeted 
in  the  hypothesis,  then  Ho  is  rejected.  We  call  this  duality.  For  example,  recall 
Example  10.1.2  where  we  were  interested  in  whether  the  average  change  in  blood 
pressure  for  drug  B  is  higher  than  for  drug  A,  i.e.  H\  :5b  >  -  This  hypothesis  is 

equivalent  to  H\  :  8b  —  8  a  >  So  =  0.  In  the  following  section,  we  develop  tests  to 
decide  whether  H\  is  statistically  significant  or  not.  Alternatively,  we  could  construct 
a  100(1  —  a)  %  confidence  interval  for  the  difference  8b  —  8  a  and  evaluate  whether 
the  interval  contains  5o  —  0  or  not;  if  yes,  we  accept  Ho ;  otherwise,  we  reject  it.  For 
some  of  the  tests  introduced  in  following  section,  we  refer  to  the  confidence  intervals 
which  lead  to  the  same  results  as  the  corresponding  test. 


1 0.3  Parametric  Tests  for  Location  Parameters 

1 0.3.1  Test  for  the  Mean  When  the  Variance  is  Known  (One-Sample 
Gauss  Test) 

We  develop  a  hypothesis  test  to  test  whether  the  unknown  mean  (expectation)  p  of  a 
N(p,  a2) -distributed  random  variable  X  either  differs  from  a  specific  value  p  =  po 
or  is  smaller  (or  greater)  than  po.  We  assume  that  the  variance  a2  =  <Tq  is  known. 
We  apply  the  scheme  of  Sect.  10.2.5  step  by  step  to  develop  the  test  procedure  and 
then  give  an  illustrative  example. 

1.  Distributional  assumption :  The  random  variable  X  follows  a  N(p,  ctq)-  distribu¬ 
tion  with  known  variance  <7q.  We  assume  that  an  i.i.d.  random  sample  is  drawn  from 
X\,  X2,  . . . ,  Xn  where  the  X;s  follow  the  same  distribution  as  X,  /  =  1,  2,  . . . ,  n. 

2.  Define  any  of  the  following  set  of  hypotheses  Ho  and  H\  \ 


Ho  :  p  =  po  versus  H\  :  p  7^  po,  (two-sided  test) 

Ho  :  p  <  po  versus  H\  :  p  >  po,  (one-sided  test) 

Ho'.  p>  po  versus  H\  :  p  <  po,  (one-sided  test). 


3.  Specify  the  probability  of  a  type  I  error  a:  Often  a  =  0.05  =  5  %  is  chosen. 

4.  Construct  a  test  statistic :  The  unknown  mean,  i.e.  the  expectation  p,  is  usually 
estimated  by  the  sample  mean  v.  We  already  know  that  if  the  X/s  are  i.i.d.,  then  the 
sample  mean  is  normally  distributed.  Under  the  assumption  that  Ho  is  true, 


i=\ 
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where  ~  means  the  “distribution  under  Hq”.  If  we  standardize  the  mean  under  Hq, 
we  get  aiV(0,  1) -distributed  test  statistic 

X  —  /in  / —  Hq 

T (X)  = - —Jn  ~  N( 0,  1), 

cr 0 

see  also  Theorem  7.3.2.  Note  that  T (X)  follows  a  normal  distribution  even  if  the 
Xi  s  are  not  normally  distributed  and  if  n  is  large  enough  which  follows  from  the 
Central  Limit  Theorem  (Appendix  C.3).  One  can  conclude  that  the  distributional 
assumption  from  step  1  is  thus  particularly  important  for  small  samples,  but  not 
necessarily  important  for  large  samples.  As  a  rule  of  thumb,  n  >  30  is  considered 
to  be  a  large  sample.  This  rule  is  based  on  the  knowledge  that  a  t -distribution  with 
more  than  30  degrees  of  freedom  gets  very  close  to  a  N(0,  1) -distribution. 

5.  Critical  region :  Since  the  test  statistic  T(X)  is  N(0,  1) -distributed,  we  get  the 
following  critical  regions,  depending  on  the  hypothesis: 


Case 

Ho 

Hi 

Critical  region  K 

(a) 

(b) 

(c) 

IV  IA  II 

o  o  o 

n  7^  /A 0 
n  >  no 

n  <  no 

K  =  (—OO,  —Zl-ct/2)  U  (zi-a/2,  OO) 
K  =  (zi-a,  oo) 

K  =  (-OO,  =  -Zl-a) 

For  case  (a)  with  Ho:  fi  =  Ho  and  H\\  i±  /  /io,  we  are  interested  in  extreme 
values  of  the  test  statistic  on  both  tails:  very  small  values  and  very  large  values  of  the 
test  statistic  give  us  evidence  that  Hq  is  wrong  (because  the  statistic  is  mainly  driven 
by  the  difference  of  the  sample  mean  and  the  test  value  ho  for  a  fixed  variance), 
see  Fig.  10.1.  In  such  a  two-sided  test,  when  the  distribution  of  the  test  statistic  is 
symmetric,  we  divide  the  critical  region  into  two  equal  parts  and  assign  each  region 
of  size  a/ 2  to  the  left  and  right  tails  of  the  distribution.  For  a  =  0.05,  2.5  %  of  the 
most  extreme  values  towards  the  right  end  of  the  distribution  and  2.5  %  of  the  most 
extreme  values  towards  the  left  end  of  the  distribution  give  us  enough  evidence  that 
Hq  is  wrong  and  can  be  rejected  and  that  H\  is  accepted.  It  is  also  clear  why  a  is 


Fig.  10.1  Critical  region  of  a  two-sided  one-sample  Gauss-test  Ho :  n  =  no  versus  H\  \  n  ^  no. 
The  critical  region  K  —  (— oo,  —z\-a/2)  U  (z\-a/2,  oo)  has  probability  mass  a  if  Hq  is  true* 
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Fig.  10.2  Critical  region  of  a  one-sided  one-sample  Gauss  test  H$\  fi  <  /iq  versus  H\\  fi  >  hq. 
The  critical  region  K  =  (z\-a,  °°)  has  probability  mass  a  if  Hq  is  true* 


the  probability  of  a  type  I  error:  the  most  extreme  values  in  the  two  tails  together 
have  5  %  probability  and  are  just  the  probability  that  the  test  statistic  falls  into  the 
critical  region  although  Ho  is  true.  Also,  these  areas  are  those  which  have  the  least 
probability  of  occurring  if  Ho  is  true.  For  a  =  0.05,  we  get  zi_«  =  1.96. 

For  case  (b),  only  one  direction  is  of  interest.  The  critical  region  lies  on  the  right 
tail  of  the  distribution  of  the  test  statistic.  A  very  large  value  of  the  test  statistic  has 
a  low  probability  of  occurrence  if  Ho  is  true.  An  illustration  is  given  in  Fig.  10.2:  for 
a  =  0.05,  we  get  z i-a  =  1-64  and  any  values  greater  than  1.64  are  unlikely  to  be 
observed  under  Ho.  Analogously,  the  critical  region  for  case  (c)  is  constructed.  Here, 
the  shaded  area  (critical  region)  is  on  the  left-hand  side.  In  this  case,  for  a  =  0.05, 
we  get  z«  =  -zi-a  =  -1.64. 


6.  Realization  of  the  test  statistic :  For  an  observed  sample  x\,  X2,  . . . ,  xn, 
metic  mean 


v  = 


1 

n 


Xi 


i= 1 


the  arith- 


is  used  to  calculate  the  realized  (observed)  test  statistic  t(x)  =  T(x\,  xz,. . . ,  xn)  as 

t(x)  =  - ^/n. 

cro 

7.  Test  decision :  If  the  realized  test  statistic  from  step  6  falls  into  the  critical  region, 
Ho  is  rejected  (and  therefore,  H\  is  statistically  proven).  Table  10.1  summarizes  the 
test  decisions  depending  on  t  (v)  and  the  quantiles  defining  the  appropriate  critical 
regions. 


Example  10.3.1  A  bakery  supplies  loaves  of  bread  to  supermarkets.  The  stated  sell¬ 
ing  weight  (and  therefore  the  required  minimum  expected  weight)  is  p  =  2  kg. 
However,  not  every  package  weighs  exactly  2  kg  because  there  is  variability  in  the 
weights.  It  is  therefore  important  to  find  out  if  the  average  weight  of  the  loaves 
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Table  1 0.1  Rules  to  make  test  decisions  for  the  one-sample  Gauss  test  (and  the  two-sample  Gauss 
test,  the  one-sample  approximate  binomial  test,  and  the  two-sample  approximate  binomial  test — 
which  are  all  discussed  later  in  this  chapter) 


Case 

Ho 

Hi 

Reject  Ho  if 

(a) 

o 

II 

=1 

P  7^  P0 

\t(x)\  >  Zl-a/2 

(b) 

IV 

o 

A 

o 

t(x)  <  Za 

(c) 

o 

=3. 

VI 

=3. 

V 

o 

t(x)  >  Zl-a 

is  significantly  smaller  than  2  kg.  The  weight  X  (measured  in  kg)  of  the  loaves  is 
assumed  to  be  normally  distributed.  We  assume  that  the  variance  <7q  =  0. 12  is  known 
from  experience.  A  supermarket  draws  a  sample  of  n  =  20  loaves  and  weighs  them. 
The  average  weight  is  calculated  as  v  =  1.97  kg.  Since  the  supermarket  wants  to  be 
sure  that  the  weights  are,  on  average,  not  lower  than  2  kg,  a  one-sided  hypothesis 
is  appropriate  and  is  formulated  as  Hq:  p  >  po  =  2  kg  versus  H\ :  p  <  /ip  =  2  kg. 
The  significance  level  is  specified  as  a  =  0.05,  and  therefore,  z i-a  =  1-64.  The  test 
statistic  is  calculated  as 

t(x)  = - —Vn  = - V20  =  -1.34. 

(70  0.1 

The  null  hypothesis  is  not  rejected,  since  t(x )  =  —1.34  >  —1.64  =  — zi-o.05  = 
£0.05- 

Interpretation :  The  sample  average  v  =  1 .97  kg  is  below  the  target  value  of  p  =  2  kg. 
But  there  is  not  enough  evidence  to  reject  the  hypothesis  that  the  sample  comes  from 
a  A (2,  0.12)-distributed  population.  The  probability  to  observe  a  sample  of  size 
n  =  20  with  an  average  of  at  most  1.97  in  a  N( 2,  0.12)-distributed  population  is 
greater  than  a  =  0.05  =  5  %.  The  difference  between  v  =  1.97  kg  and  the  target 
value  p  =  2  kg  is  not  statistically  significant. 

Remark  10.3.1  The  Gauss  test  assumes  the  variance  to  be  known,  which  is  often 
not  the  case  in  practice.  The  Gtest  (Sect.  10.3.2)  assumes  that  the  variance  needs  to 
be  estimated.  The  Gtest  is  therefore  commonly  employed  when  testing  hypotheses 
about  the  mean.  Its  usage  is  outlined  below.  In  R ,  the  command  Gauss  .test  from 
the  library  compositions  offers  an  implementation  of  the  Gauss  test. 


1 0.3.2  Test  for  the  Mean  When  the  Variance  is  Unknown 
(One-Sample  J-Test) 

If  the  variance  a2  is  unknown,  hypotheses  about  the  mean  p  of  a  normal  random 
variable  X  ~  N(p,  a2)  can  be  tested  in  a  similar  way  to  the  one-sample  Gauss 
test.  The  difference  is  that  the  unknown  variance  is  estimated  from  the  sample.  An 
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unbiased  estimator  of  a2  is  the  sample  variance 


1 


X(v  -  *)2- 

i= 1 


The  test  statistic  is  therefore 


T(X)  = 


which  follows  a  t -distribution  with  n  —  1  degrees  of  freedom  if  Ho  is  true,  as  we 
know  from  Theorem  8.3.2. 


Critical  regions  and  test  decisions 

Since  T  (X)  follows  a  t -distribution  under  Ho ,  the  critical  regions  refer  to  the  regions 
of  the  t  -distribution  which  are  unlikely  to  be  observed  under  Ho  '. 


Case 

H0 

Hi 

Critical  region  K 

(a) 

(b) 

(c) 

IA  IV  II 

*£  *£  *£ 
000 

P  7^  Po 
P  <  po 

P  >  Po 

E  —  (  OO,  tn  —  0\—a/ 2)  U  (tn—l;  1— a/2)  OO) 
E  —  (  OO,  tn—  1;  1  —  a) 

E  —  —  OO) 

The  hypothesis  Ho  is  rejected  if  the  realized  test  statistic,  i.e. 


x  -  fJLQ 
Sx 


5 


falls  into  the  critical  region.  The  critical  regions  are  based  on  the  appropriate  quantiles 
of  the  t -distribution  with  (n  —  1)  degrees  of  freedom,  as  outlined  in  Table  10.2. 


Example  10.3.2  We  again  consider  Example  10.3.1.  Now  we  assume  that  the  vari¬ 
ance  of  the  loaves  is  unknown.  Suppose  a  random  sample  of  size  n  =  20  has  an 
arithmetic  mean  ofx  =  1.9668  and  a  sample  variance  of  s2  =  0.09272.  We  want  to 
test  whether  this  result  contradicts  the  two-sided  hypothesis  Ho  ',  p  =  2,  that  is  case 
(a).  The  significance  level  is  fixed  at  a  =  0.05.  For  the  realized  test  statistic  t(x),  we 
calculate 


1.9668  -2  r- 

- V20 

0.0927 


-1.60. 


Table  1 0.2  Rules  to  make  test  decisions  for  the  one-sample  t-test  (and  the  two-sample  t-test,  and 
the  paired  t-test,  both  explained  below) 


Case 

Ho 

Hi 

Reject  Ho,  if 

(a) 

0 

=3. 

II 

=3. 

p  ^  po 

|r(x)|  >  tn- 1;1— a/2 

(b) 

'S: 

IV 

0 

A 

0 

e 

1 

7 

V 

5 

4** 

(c) 

0 

VI 

=3. 

V 

0 

t(v)  >  tn—  1 ;  1  — Q; 
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Ho  is  not  rejected  since  \t\  =  1.60  <  2.09  =  t\g- 0.975,  where  the  quantiles  ±2.09 
are  defining  the  critical  region  (see  Table  C.2  or  use  R :  qt  (0.975, 19)).  The  same 
results  can  be  obtained  in  R  using  the  t .  test  ()  function,  see  Example  10.3.3  for 
more  details.  Or,  we  can  directly  calculate  the  (two-sided)  p -value  as 

2* (1-pt (abs (1 . 6)  ,  df=19))  |£^ 

This  yields  a  p-v alue  of  0.1260951  which  is  not  smaller  than  a,  and  therefore, 
Ho  is  not  rejected. 


10.3.3  Comparing  the  Means  of  Two  Independent  Samples 

In  a  two-sample  problem,  we  may  be  interested  in  comparing  the  means  of  two  inde¬ 
pendent  samples.  Assume  that  we  have  two  samples  of  two  normally  distributed  vari¬ 
ables  X  ~  N(p.x,  cr\)  and  Y  ~  N(pJy,  cry)  of  size  n\  and  n2,  i.e.  X\,  X2,  . . . ,  Xni 
are  i.i.d.  with  the  same  distribution  as  X  and  Y\ ,  Y2,  . . . ,  Yni  are  i.i.d.  with  the  same 
distribution  as  Y.  We  can  specify  the  following  hypotheses: 


Case 

Null  hypothesis 

Alternative  hypothesis 

(a) 

(b) 

(c) 

f^X  =  lJY 
f^X  > 

Mx  <  Mr 

Mx  ±  Mr 

Mx  <  Mr 

Mx  >  Mr 

Two-sided  test  problem 
One-sided  test  problem 
One-sided  test  problem 

We  distinguish  another  three  cases: 

1 .  cr\  and  dy  are  known. 

T  9  9  9 

2.  crj  and  cry  are  unknown,  but  they  are  assumed  to  be  equal,  i.e.  crj  =  cry. 

3.  Both  g\  and  o\  are  unknown  and  unequal  ( a\  /  cry). 

Case  1:  The  variances  are  known  (two-sample  Gauss  test). 

If  the  null  hypothesis  Ho  :  fix  =  I^y  is  true,  then,  using  the  usual  rules  for  the  normal 
distribution  and  the  independence  of  the  samples, 


and 
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It  follows  that  the  test  statistic 


T(X,  Y) 


X-Y 


(10.1) 


follows  a  standard  normal  distribution,  T(X,  Y)  ~  N(0,  1).  The  realized  test  statistic 
is 


t(x ,  y ) 


x  -  y 


(10.2) 


The  test  procedure  is  identical  to  the  procedure  of  the  one-sample  Gauss  test  intro¬ 
duced  in  Sect.  10.3.1;  that  is,  the  test  decision  is  based  on  Table  10.1. 


Case  2:  The  variances  are  unknown,  but  equal  (two-sample  t-test). 

We  denote  the  unknown  variance  of  both  distributions  as  a2  (i.e.  both  the  populations 
are  assumed  to  have  variance  a2).  We  estimate  a2  by  using  the  pooled  sample  variance 
where  each  sample  is  assigned  weights  relative  to  the  sample  size: 


The  test  statistic 


■2  _  in i  ~  1)4  +  (W2  -  1  )Sj 
n  1+/12  —  2 


T  (X,  Y) 


X  —  Y  I  n\  •  ri2 


S  V  n\  +  n2 


(10.3) 

(10.4) 


with  S  as  in  (10.3)  follows  a  t -distribution  with  n\  +  ri2  —  2  degrees  of  freedom  if 
H0  is  true.  The  realized  test  statistic  is 


t(x,  y) 


x  —  y  /  n\  •  ri2 


n\  +  ri2 


(10.5) 


The  test  procedure  is  identical  to  the  procedure  of  the  one-sample  t-test;  that  is,  the 
test  decision  is  based  on  Table  10.2. 


Case  3:  The  variances  are  unknown  and  unequal  (Welch  test). 

We  test  Ho :  fix  =  Hy  versus  H\\  fix  /  I^y  given  a\  /  and  both  g\  and 
are  unknown.  This  problem  is  also  known  as  the  Behrens-Fisher  problem  and  is  the 
most  frequently  used  test  when  comparing  two  means  in  practice.  The  test  statistic 
can  be  written  as 


T  (X,  Y) 


(10.6) 


which  is  approximately  t -distributed  with  v  degrees  of  freedom: 


(10.7) 
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where  s2  and  s2  are  the  estimated  values  of  S\  =  ~  X)2  and  Sy  = 

^-j-  ^i=\  (Yt  —  T  )2,  respectively.  The  test  procedure,  using  the  observed  test  statistic 

\x  —  y\ 

t(x,y)  =  -±=^=,  (10.8) 

M  +  fi 

y  n\  '  ^2 

is  identical  to  the  procedure  of  the  one-sample  £-test;  that  is,  the  test  decision  is  based 
on  Table  10.2  except  that  the  degrees  of  freedom  are  not  n  —  1  but  v.  If  v  is  not  an 
integer,  it  can  be  rounded  off  to  an  integer  value. 

Example  10.3.3  A  small  bakery  sells  cookies  in  packages  of  500  g.  The  cookies  are 
handmade  and  the  packaging  is  either  done  by  the  baker  himself  or  his  wife.  Some 
customers  conjecture  that  the  wife  is  more  generous  than  the  baker.  One  customer 
does  an  experiment:  he  buys  packages  of  cookies  packed  by  the  baker  and  his  wife 
on  16  different  days  and  weighs  the  packages.  He  gets  the  following  two  samples 
(one  for  the  baker,  one  for  his  wife). 


Weight  (wife)  (X) 

512 

530 

498 

540 

521 

528 

505 

523 

Weight  (baker)  (Y) 

499 

500 

510 

495 

515 

503 

490 

511 

We  want  to  test  whether  the  complaint  of  the  customers  is  justified.  Let  us  start 
with  the  following  simple  hypotheses: 


Hq  :  fJLx  =  fiy  versus  H\  :  px  /  py, 


i.e.  we  only  want  to  test  whether  the  weights  are  different,  not  that  the  wife  is  making 
heavier  cookie  packages.  Since  the  variances  are  unknown,  we  assume  that  case  3  is 
the  right  choice.  We  calculate  and  obtain  v  =  519.625,  y  =  502.875,  =  192.268, 

and  =  73.554.  The  test  statistic  is: 


t(x,  y)  = 


\x  -  y  I 


si  +  si 


1519.625  -502.8751 

/ 192.268  +  73.554 


2.91 


n  i 


n  2 


8 


8 


The  degrees  of  freedom  are: 


/ 192.268  73.554\2  /(192.268/8)2  (73.554/8)2\ 

v  =  [ - 1- - I  /I  - + -  I  ^  11.67  ^  12. 

\  8  8/7y7  7  J 

Since  \  t(x)  \  =  2.91  >  2.18  =  fi2;o.975>  it  follows  that  Ho  is  rejected.  Therefore,  H\  is 
statistically  significant.  This  means  that  the  mean  weight  of  the  wife’s  packages  is 
different  from  the  mean  weight  of  the  baker’s  packages.  Let  us  refine  the  hypothesis 
and  try  to  find  out  whether  the  wife’s  packages  have  a  higher  mean  weight.  The 
hypotheses  are  now: 


Ho  :  px  <  l^y  versus  H\  :  /ix  >  py. 
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The  test  statistic  remains  the  same  but  the  critical  region  and  the  degrees  of  freedom 
change.  Thus,  Ho  is  rejected  if  t(x ,  y)  >  tv- \-a.  Using  tv- \-a  =  t i2;o.95  ~  1-78  and 
t(x,  y)  =  2.91,  it  follows  that  the  null  hypothesis  can  be  rejected.  The  mean  weight 
of  the  wife’s  packages  is  greater  than  the  mean  weight  of  the  baker’s  packages. 

In  R,  we  would  have  obtained  the  same  result  using  the  t .  test  command: 

x  <-  c(512, 530, 498, 540, 521, 528, 505, 523) 
y  <-  c(499, 500, 510, 495, 515, 503, 490, 511) 
t . test (x , y , alternat ive= ' greater  1 ) 


Welch  Two-Sample  t-test 


data:  x  and  y 

t  =  2.9058,  df  =  11.672,  p-value  =  0.006762 

alternative  hypothesis:  true  difference  in  means  is  greater 

than  0 . . . 

Note  that  we  have  to  specify  the  alternative  hypothesis  under  the  option  alter¬ 
native.  The  output  shows  us  the  test  statistic  (2.9058),  the  degrees  of  freedom 
(11.672),  the  alternative  hypothesis — but  not  the  decision  rule.  We  know  that  Ho 
is  rejected  if  t(x,  y)  >  tv- i_a,  so  the  decision  is  easy  in  this  case:  we  simply  have 
to  calculate  ti2;o.95  using  qt  (0 . 95 , 12)  in  R.  A  simpler  way  to  arrive  at  the  same 
decision  is  to  use  the  p -value.  We  know  that  Ho  is  rejected  if  p  <  a  which  is  the 
case  in  this  example.  It  is  also  worthwhile  mentioning  that  R  displays  the  hypothe¬ 
ses  slightly  differently  from  ours:  our  alternative  hypothesis  is  px  >  fiy  which  is 
identical  to  the  statement  px  —  /iy  >  0,  as  shown  by  R ,  see  also  Sect.  10.2.2. 

If  we  specify  two .  sided  as  an  alternative  (which  is  the  default),  a  confidence 
interval  for  the  mean  difference  is  also  part  of  the  output: 

t . test (x , y , alternat ive= ' two . sided ' ) 


95  7.  confidence  interval: 

4.151321  29.348679 

It  can  be  seen  that  the  confidence  interval  of  the  difference  does  not  cover  the  “0”. 
Therefore,  the  null  hypothesis  is  rejected.  This  is  the  duality  property  referred  to 
earlier  in  this  chapter:  the  test  decision  is  the  same,  no  matter  whether  one  evaluates 
(i)  the  confidence  interval,  (ii)  the  test  statistic,  or  (iii)  the  y>-value. 

Any  kind  of  t-test  can  be  calculated  with  the  t  .test  command:  for  example, 
the  two-sample  t-test  requires  to  specify  the  option  var .  equal =TRUE  while  the 
Welch  test  is  calculated  when  the  (default)  option  var .  equal=FALSE  is  set.  We  can 
also  conduct  a  one-sample  t- test.  Suppose  we  are  interested  in  whether  the  mean 
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weight  of  the  wife’s  packages  of  cookies  is  greater  than  500  g;  then,  we  could  test 
the  hypotheses: 


Ho  :  Hx  <  500  versus  H\  :  (ix  >  500. 
In  R,  we  simply  have  to  specify  po'. 


t .test (x,mu=500,alternative= ' greater  1 ) 


F 


which  gives  us 


One-Sample  t-test 


data:  x 


t  =  4.0031,  df  =  7,  p-value  =  0.002585 

alternative  hypothesis:  true  mean  is  greater  than  500 


1 0.3.4  Test  for  Comparing  the  Means  of  Two  Dependent  Samples 
(Paired  £-Test) 

Suppose  there  are  two  dependent  continuous  random  variables  X  and  Y  with  E(X)  = 
fix  and  E(F)  =  fiy  •  They  could  be  dependent  because  we  measure  the  same  variable 
twice  on  the  same  subjects  at  different  times.  Typically,  this  is  the  case  in  pre-post 
experiments,  for  example  when  we  measure  the  weight  of  a  person  before  starting  a 
special  diet  and  after  finishing  the  diet;  or  when  evaluating  household  expenditures 
on  electronic  appliances  in  two  consecutive  years.  We  then  say  that  the  samples 
are  paired ,  or  dependent.  Since  the  same  variable  is  measured  twice  on  the  same 
subject,  it  makes  sense  to  calculate  a  difference  between  the  two  respective  values. 
Let  D  =  X  —  Y  denote  the  random  variable  “difference  of  X  and  F”.  If  Hq  :  fix  =  dY 
is  true,  then  the  expected  difference  is  zero,  and  we  get  E (D)  =  pD  =  0.  This  means 
testing  Ho  \  px  =  dY  is  identical  to  testing  fiX  —  fiy  =  dD  =  0-  We  further  assume 
that  D  is  normally  distributed  if  Ho  '.fix  =  fiy  is  true  (or  equivalently  if  Ho :  fiD  =  0  is 
true),  i.e.  D  ~  N(0,  (j2d).  For  a  random  sample  (D  i,  D2,  . . . ,  Dn)  of  the  differences, 
the  test  statistic 


(10.9) 


is  ^-distributed  with  n  —  1  degrees  of  freedom.  The  sample  mean  is  D  =  X/=i  /  A  ^ 
and  the  sample  variance  is 


2  z;=i(a  -d)2 


n  —  1 
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which  is  an  estimator  of  cf2d  .  The  realized  test  statistic  is  thus 

d  r- 

t(d)  =  —Vn  (10.10) 

Sd 

where  d  =  X/7=i  di  / n  and  sd  =  X/=i  (di  —  d)2/n  —  1. 

The  two-sided  test  Ho :  po  =0  versus  H\:  po  /  0  and  the  one-sided  tests  Ho: 
/Id  <  0  versus  H\  \  po  >  0  or  Ho:  hd  >  0  versus  H\\  /id  <0  can  be  derived  as  in 
Sect.  10.3.2;  that  is,  the  test  decision  is  based  on  Table  10.2.  In  fact,  the  paired  t-test 
is  a  one-sample  t-test  on  the  differences  of  X  and  Y. 

Example  10.3.4  In  an  experiment,  n  =  10  students  have  to  solve  different  tasks 
before  and  after  drinking  a  cup  of  coffee.  Let  Y  and  X  denote  the  random  variables 
“number  of  points  before/after  drinking  a  cup  of  coffee”.  Assume  that  a  higher 
number  of  points  means  that  the  student  is  performing  better.  Since  the  test  is  repeated 
on  the  same  students,  we  have  a  paired  sample.  The  data  is  given  in  the  following 
table: 


i 

yi  (before) 

Xj  (after) 

K 

II 

^3 

C di  -  d)2 

1 

4 

5 

1 

0 

2 

3 

4 

1 

0 

3 

5 

6 

1 

0 

4 

6 

7 

1 

0 

5 

7 

8 

1 

0 

6 

6 

7 

1 

0 

7 

4 

5 

1 

0 

8 

7 

8 

1 

0 

9 

6 

5 

-1 

4 

10 

2 

5 

3 

4 

Total 

10 

8 

We  calculate 


d  =  1  and  s2d  =  -  =  0.9432, 

respectively.  For  the  realized  test  statistic  t(d),  using  a  =  0.05,  we  get 

t(d)  — - VTO  =  3.35  >  tg-o  95  —  1.83, 

0.943  ’ 

such  that  Ho:  fix  <  Ey  is  rejected  and  H\ :  px  >  dy  is  accepted.  We  can  conclude 
(for  this  example)  that  drinking  coffee  significantly  increased  the  problem-solving 
capacity  of  the  students. 
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In  R ,  we  would  have  obtained  the  same  results  using  the  t  .test  function  and 
specifying  the  option  paired=TRUE: 

yp  <-  c(4, 3, 5, 6,7, 6, 4, 7, 6, 2)  R 

xp  <-  c(5, 4, 6, 7, 8, 7, 5, 8, 5, 5) 
t . test (xp ,yp ,paired=TRUE) 


Paired  t-test 


data:  xp  and  yp 

t  =  3.3541,  df  =  9,  p-value  =  0.008468 

alternative  hypothesis:  true  difference  in  means  !=  0 

95  °/0  confidence  interval: 

0.325555  1.674445 
sample  estimates: 
mean  of  the  differences 

1 

We  can  make  the  test  decision  using  the  R  output  in  three  different  ways: 

(i)  We  compare  the  test  statistic  ( t  =  —3.35)  with  the  critical  value  (1.83,  obtained 
via  qt  (0.95,9)). 

(ii)  We  evaluate  whether  the  y>-value  (0.008468)  is  smaller  than  the  significance 
level  a  =  0.05. 

(iii)  We  evaluate  whether  the  confidence  interval  for  the  mean  difference  covers  “0” 
or  not. 


1 0.4  Parametric  Tests  for  Probabilities 
1 0.4.1  One-Sample  Binomial  Test  for  the  Probability p 

Test  construction  and  hypotheses. 

Let  X  be  a  Bernoulli  B(  1;  p)  random  variable  with  the  two  possible  outcomes  1 
and  0,  which  indicate  occurrence  and  non-occurrence  of  an  event  of  interest  A.  The 
probability  for  A  in  the  population  is  p.  From  the  sample  X  =  (X\ ,  X2,  . . . ,  Xn)  of 
independent  B(  1;  p) -distributed  random  variables,  we  calculate  the  mean  (relative 
frequency)  as  p  =  ^  X/=i  %i  which  is  an  unbiased  estimate  of  p.  The  following 
hypotheses  may  thus  be  of  interest: 


Case 

Null  hypothesis 

Alternative  hypothesis 

(a) 

II 

O 

P  +■  P0 

Two-sided  problem 

(b) 

IV 

^3 

O 

P  <  PO 

One-sided  problem 

(c) 

O 

VI 

P  >  P0 

One-sided  problem 
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In  the  following,  we  describe  two  possible  solutions,  one  exact  approach  and  an 
approximate  solution.  The  approximate  solution  is  based  on  the  approximation  of 
the  binomial  distribution  by  the  normal  distribution,  which  is  appropriate  if  n  is 
sufficiently  large  and  the  condition  np{\  —  p)  >9  holds  (i.e.  p  is  neither  too  small 
nor  too  large).  First,  we  present  the  approximate  solution  and  then  the  exact  one. 

Test  statistic  and  test  decisions. 


(a)  Approximate  binomial  test.  We  define  the  standardized  test  statistic  as 

p  —  po 

T(X)  =  F  F 


n. 


, _ v...  (10.11) 

V po(l  -  po) 

It  holds  approximately  that  T  (X)  ~  N( 0,  1),  given  that  the  conditions  that  (i) 
n  is  sufficiently  large  and  (ii)  np(l  —  p)  >  9  are  satisfied.  The  test  can  then 
be  conducted  along  the  lines  of  the  Gauss  test  in  Sect.  10.3.1;  that  is,  the  test 
decision  is  based  on  Table  10.1. 


Example  10.4.1  We  return  to  Example  10.1.1.  Let  us  assume  that  a  representative 
sample  of  size  n  =  2000  has  been  drawn  from  the  population  of  eligible  voters,  from 
which  700  (35  %)  have  voted  for  the  party  of  interest  P .  The  research  hypothesis 
(which  has  to  be  stated  as  H\)  is  that  more  than  30%  (i.e.  po  =  0.3)  of  the  eligible 
voters  cast  their  votes  for  party  P .  The  sample  is  in  favour  of  H\  because  p  =  35  %, 
but  to  draw  conclusions  for  the  proportion  of  voters  of  party  P  in  the  population, 
we  have  to  conduct  a  binomial  test.  Since  n  is  large  and  np(\  —  p)  =  2000  •  0.35  • 
0.65  =  455  ^  9,  the  assumptions  for  the  use  of  the  test  statistic  (10. 1 1)  are  satisfied. 
We  can  write  down  the  realized  test  statistic  as 


t(x)  = 


PO 


0.35-0.3 


n  = 


V2000  =  4.8795, 


V?o(l  -  po)  V0.3(l  -0.3) 

Using  a  =  0.05,  it  follows  that  T (X)  =  4.8795  >  z i-a  =  1-64,  and  thus,  the  null 
hypothesis  Ho  :  p  <  0.3  can  be  rejected.  Therefore,  H\  :  p  >  0.3  is  statistically 
significant;  that  is,  the  proportion  of  votes  for  party  P  is  greater  than  30  %. 


(b)  The  exact  binomial  test  can  be  constructed  using  the  knowledge  that  under  Ho , 
Y  =  Xi  (i.e.  the  number  of  successes)  follows  a  binomial  distribution.  In 
fact,  we  can  use  Y  directly  as  the  test  statistic: 

T(X)  =  Y  -  B(n,p0)  • 

The  observed  test  statistic  is  t(x)  =  JL  v/.  For  the  two-sided  case  (a),  the  two 
critical  numbers  q  and  cr  ( c/  <  cr)  which  define  the  critical  region,  have  to  be 
found  such  that 

Ph0(Y  <  ci)  <  ^  and  PHq(Y  >  cr)  < 

The  null  hypothesis  is  rejected  if  the  test  statistic,  i.e.  Y,  is  greater  than  or  equal 
to  cr  or  less  than  or  equal  to  c/ .  For  the  one-sided  case,  a  critical  number  c  has 
to  be  found  such  that 


Ph0(Y  <  c)  <  a 
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for  hypotheses  of  type  (b)  and 

Ph0(y  >c)<a 

for  hypotheses  of  type  (c).  If  Y  is  less  than  the  critical  value  c  (for  case  (b))  or 
greater  than  the  critical  value  (for  case  (c)),  the  null  hypothesis  is  rejected. 

Example  10.4.2  We  consider  again  Example  10.1.1  where  we  looked  at  the  popula¬ 
tion  of  eligible  voters,  from  which  700  (35  %)  have  voted  for  the  party  of  interest  P . 
The  observed  test  statistic  is  t(x)  =  JT  jc/  =  700  and  the  alternative  hypothesis  is 
H\  :  p  >  0.3,  as  in  case  (c).  There  are  at  least  two  ways  in  which  we  can  obtain  the 
results: 


(i)  Long  way :  We  can  calculate  the  test  statistic  and  compare  it  to  the  critical  region. 
To  get  the  critical  region,  we  search  c  such  that 

Pp=0.3(Y  >c)<  0.05  , 

which  equates  to 


Pp=03(Y  <c)  >0.95 


and  can  be  calculated  in  R  as: 

qbinom(p=0 . 95 ,  prob=0.3,  size=2000) 
[1]  634 


Since  Y  =  700  >  c  =  634  we  reject  the  null  hypothesis.  As  in  Example  10.4.1, 
we  conclude  that  there  is  enough  evidence  that  the  proportion  of  votes  for  party 
P  is  greater  than  30  %. 


(ii)  Short  way:  The  above  result  can  be  easily  obtained  in  R  using  the  binom. 
test  ()  command.  We  need  to  specify  the  number  of  “successes”  (here:  700), 
the  number  of  “failures”  (2000  —  700  =  1300),  and  the  alternative  hypothesis: 

binom. test (c (700, 1300) ,p=0 . 3 , alternative= ' greater ' ) 


data:  c(700,  1300) 

number  of  successes  =  700,  number  of  trials  =  2000, 
p-value  =  8.395e-07 

alternative  hypothesis:  true  probability  of  success 

is  greater  than  0.3 

95  °/0  confidence  interval: 

0.332378  1.000000 


probability  of  success 

0.35 
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Both  the  p -value  (which  is  smaller  than  a  =  0.05)  and  the  confidence  interval 
(for  which  we  do  not  show  the  calculation)  confirm  the  rejection  of  the  null 
hypothesis. 

Note  that 

binom. test (x=700 ,n=2000 ,p=0 . 3 ,  alternative= 1  greater ' ) 
returns  the  same  result. 


1 0.4.2  Two-Sample  Binomial  Test 

Test  construction  and  hypotheses. 

We  consider  now  the  case  of  two  independent  i.i.d.  samples  from  Bernoulli  distrib¬ 
utions  with  parameters  p\  and  p2 . 

X  =  (X\,  X2,  . . . ,  Xni),  Xi~B(Upi) 

Y  =  (YuY2,...,Yn2),  Yi~B(Up2). 

The  sums 

n\  U2 

X  =  YJXi  ~B(nupi),  Y  =  YjYi~B(n2;p2) 

i  =  1  i= 1 

follow  binomial  distributions.  One  of  the  following  hypotheses  may  be  of  interest: 


Case 

Null  hypothesis 

Alternative  hypothesis 

(a) 

Pi  =  P2 

Pl  /  P2 

Two-sided  problem 

(b) 

Pi  >  P2 

Pi  <  P2 

One-sided  problem 

(c) 

Pi  <  P2 

Pi  >  P2 

One-sided  problem 

Similar  to  the  one-sample  case,  both  exact  and  approximate  tests  exist.  Here,  we  only 
present  the  approximate  test.  The  exact  test  of  Fisher  is  presented  in  Appendix  C.5, 
p.  428.  Let  n\  and  112  denote  the  sample  sizes.  Then,  X/n\  and  Y /n 2  are  approxi¬ 
mately  normally  distributed: 


Their  difference  D 


D 


X 

n\ 


approx. 


approx. 


ri2 


approx. 

~  N 


( 


N 

N 


0,p(l 


P i(l  ~  Pi) 

n\ 

P2(  1  ~  pi) 
ri2 

■p)(—  + 

\n  1 
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is  normally  distributed  too  under  Ho  (given  p  =  p\  =  p2  holds).  Since  the  proba¬ 
bilities  p\  and  p2  are  identical  under  Ho,  we  can  pool  the  two  samples  and  estimate 

p  by 


X  +  Y 
n\  +  Ti2 


(10.12) 


Test  statistic  and  test  decision. 

The  test  statistic 

T(X,  Y) 


(10.13) 


follows  a  N(0,  1) -distribution  if  n\  and  ri2  are  sufficiently  large  and  p  is  not  near  the 
boundaries  0  and  1  (one  could  use,  for  example,  again  the  condition  np(  1  —  p)  >  9 
with  n  =  n\  +  ri2).  The  realized  test  statistic  can  be  calculated  using  the  observed 

/V 

difference  d  =  p\  —  p2-  The  test  can  be  conducted  for  the  one-sided  and  the  two- 
sided  case  as  the  Gauss  test  introduced  in  Sect.  10.3.1 ;  that  is,  the  decision  rules  from 
Table  10.1  can  be  applied. 


Example  10.4.3  Two  competing  lotteries  claim  that  every  fourth  lottery  ticket  wins. 
Suppose  we  want  to  test  whether  the  probabilities  of  winning  are  different  for  the 
two  lotteries,  i.e.  Ho  :  p \  =  P2  and  H\  :  p\  ^  p2.  We  have  the  following  data 


n  Winning  Not  winning 

Lottery  A 
Lottery  B 

63  14  49 

45  13  32 

We  can  estimate  the  probabilities  of  a  winning  ticket  for  each  lottery,  as  well  as  the 
respective  difference,  as 


14 


13 


Pa  = 


Pb  = 


63’  ^  45’ 

Under  Ho,  an  estimate  for  p  following  (10.12)  is 


d  =  pA-  pB 


1 

15 


14+13 
63  +  45 


27 

108 


0.25. 


The  test  statistic  can  be  calculated  as 


t(x,  y ) 


15 


0-25)  +  +  + 


-0.79. 


Ho  is  not  rejected  since  \t(x,  y)|  =  0.79  <  1.96  =  zi-o.05/2-  Thus,  there  is  no  statis¬ 
tical  evidence  for  different  winning  probabilities  for  the  two  lotteries.  These  hypothe¬ 
ses  can  be  tested  in  R  using  the  Test  of  Fisher,  see  Appendix  C. 5,  p.  428,  for  more 
details. 
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1 0.5  Tests  for  Scale  Parameters 

There  are  various  tests  available  to  test  hypotheses  about  scale  parameters.  Such 
tests  are  useful  when  one  is  interested  in  the  dispersion  of  a  variable,  for  example 
in  quality  control  where  the  variability  of  a  process  may  be  of  interest.  One-sample 
tests  of  hypotheses  for  the  variance  of  a  normal  distribution,  e.g.  hypotheses  such 
as  Ho  :  cr  =  <Jq,  can  be  tested  by  the  x2-test  for  the  variance,  see  Appendix  C.5, 
p.  430.  Two-sample  problems  can  be  addressed  by  the  F-test  (which  is  explained  in 
Appendix  C.5,  p.  431);  or  by  other  tests  such  as  the  Levene  test  or  Bartlett’s  test, 
which  are  also  available  in  R  (leveneTest  in  the  package  car,  bartlett  in  the 
base  distribution  of  R). 


1 0.6  Wilcoxon-Mann-Whitney  (WMW)  U-Test 

Test  construction  and  hypotheses. 

The  WMW  U- test  is  often  proposed  as  an  alternative  to  the  Ftest  because  it  also 
focuses  on  location  but  not  on  the  expected  value  fi.  It  is  a  nonpar ametric  test 
and  useful  in  situations  where  skewed  distributions  are  compared  with  each  other. 
We  consider  two  independent  random  samples  X  =  (X\,  X2, . . . ,  Xni)  and  Y  = 
(Y 1 ,  Y2,  ,  . . . ,  Yni)  from  two  populations  with  observed  values  (x\ ,  X2,  •  •  • ,  xni)  and 
(yi,  y2, . . . ,  yn2)’  respectively.  In  this  case,  the  null  hypothesis  Ho  considering  the 
location  can  be  formulated  as 

H0  :  P(X  >  Y)  =  P(Y  >  X)  =  1  . 

The  null  hypothesis  can  be  interpreted  in  the  following  way:  the  probability  that  a 
randomly  drawn  observation  from  the  first  population  has  a  value  v  that  is  greater 
(or  lower)  than  the  value  y  of  a  randomly  drawn  subject  from  the  second  population 
is  \  •  The  alternative  hypothesis  H\  is  then 

Hi  :  P(X  >Y)  ^  P(Y  >  X)  . 

This  means  we  are  comparing  the  entire  distribution  of  two  variables.  If  there  is  a 
location  shift  in  the  sense  that  one  distribution  is  shifted  left  (or  right)  compared 
with  the  other  distribution,  the  null  hypothesis  will  be  rejected  because  this  shift 
can  be  seen  as  part  of  the  alternative  hypothesis  P(X  >  Y)  7^  P(Y  >  X).  In  fact, 
under  some  assumptions,  the  hypothesis  can  even  be  interpreted  as  comparing  two 
medians,  and  this  is  what  is  often  done  in  practice. 

Observed  test  statistic. 

To  construct  the  test  statistic,  it  is  necessary  to  merge  (x\ ,  X2,  . . . ,  xni )  and  (yi ,  y2 , 

. . . ,  yn2)  into  one  sorted  sample,  usually  in  ascending  order,  while  keeping  the  infor¬ 
mation  which  value  belongs  to  which  sample.  For  now,  we  assume  that  all  values  of 
the  two  samples  are  distinct;  that  is,  no  ties  are  present.  Then,  each  observation  has 
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a  rank  between  1  and  (n\  +  n2).  Let  R\+  be  the  sum  of  ranks  of  the  v-sample  and 
let  R2+  be  the  sum  of  ranks  of  the  y-sample.  The  test  statistic  is  defined  as  U,  where 
U  is  the  minimum  of  the  two  values  U\ ,  U2,  U  =  min(I/i,  U2)  with 


n\(n\  +  1) 

U\  —  n\  ■  n2  ~\  —  - - --Ri+,  (10.14) 

noino  +  1) 

U2  =  ni  •  n2  +  - -  -  R2+.  (10.15) 

Test  decision. 

Hq  is  rejected  if  U  <  unun2-a.  Here,  un un2-a  is  the  critical  value  derived  from  the 
distribution  of  U  under  the  null  hypothesis.  The  exact  (complex)  distribution  can, 
for  example,  be  derived  computationally  (in  R).  We  are  presenting  an  approximate 
solution  together  with  its  implementation  in  R. 

Since  U\  +  U2  =  n\  •  n2,  it  is  sufficient  to  compute  only  7?;+  and  U  =  min {£//, 
n\n2  —  Ui }  (i  =  1  or  i  =  2  are  chosen  such  that  is  calculated  for  the  sample 
with  the  lower  sample  size).  For  n\,n2  >  8,  one  can  use  the  approximation 


T (X,  Y)  = 


U  - 


n\-ri2 


n\  •  n2  •  (n\  +  n2  +  1) 


approx. 

~  N(0,  1) 


12 


(10.16) 


as  the  test  statistic.  For  two-sided  hypotheses,  Ho  is  rejected  if  \t(x,  y)|  >  z i-a/2\ 
for  one-sided  hypotheses  Ho  is  rejected  if  \t(x,  y)|  >  zi-a-  In  the  case  of  ties,  the 
denominator  of  the  test  statistic  in  (10.16)  can  be  modified  as 


T (X,  Y)  = 


U  - 


n\-U2 


N 


approx. 

~  N(0,  1), 


n\  •  n2 

- 1 

3 

1 

1— * * 

1 - 

n  —  n 


12 


-S 

7=1 


G  t3  -t- 
lj  TJ 


12 


where  G  is  the  number  of  different  (groups  of)  ties  and  tj  denotes  the  number  of  tied 
ranks  in  tie  group  j . 


Example  10.6.1  In  a  study,  the  reaction  times  (in  seconds)  to  a  stimulus  were  mea¬ 
sured  for  two  groups.  One  group  drank  a  strong  coffee  before  the  stimulus  and  the 
other  group  drank  only  the  same  amount  of  water.  There  were  9  study  participants 
in  the  coffee  group  and  10  participants  in  the  water  group.  The  following  reaction 
times  were  recorded: 


Reaction  time 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Coffee  group  (C) 

3.7 

4.9 

5.2 

6.3 

7.4 

4.4 

5.3 

1.7 

2.9 

Water  group  (W) 

4.5 

5.1 

6.2 

7.3 

8.7 

4.2 

3.3 

8.9 

2.6 

4.8 

We  test  with  the  U -test  whether  there  is  a  location  difference  between  the  two  groups. 
First,  the  ranks  of  the  combined  sample  are  calculated  as: 
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123456789  10 

Total 

Value  (C) 
Rank  (C) 

3.7  4.9  5.2  6.3  7.4  4.4  5.3  1.7  2.9 

5  10  12  15  17  7  13  1  3 

83 

Value  (W) 
Rank  (W) 

4.5  5.1  6.2  7.3  8.7  4.2  3.3  8.9  2.6  4.8 

8  11  14  16  18  6  4  19  2  9 

107 

With  Rc+  -  83  and  Rw+  =  107,  we  get 

n\(n\  +  1)  9-10 

Ui  =  nx  •  n2  +  - -  -  Rc+  =  9  •  10  +  — -  83  =  52, 

^2(^2  +  1)  10-11 

U2  =  ni  •  n2  +  - -  -  Rw+  =  9  •  10  +  — -  107  =  38. 

With  n\,n2  >  8  and  U  =  U2  =  38, 


f(x,  y)  = 


{/  - 


n\-ri2 


38  - 


9-10 


n\-n2-  (n\  +^2  +  1) 


9-  10  -  (9  +  10+  1) 


-0.572. 


12 


12 


Since  \t(x,  y)|  =  0.572  <  zi— qj/2  =  1-96,  the  null  hypothesis  cannot  be  rejected; 
that  is,  there  is  no  statistical  evidence  that  the  two  groups  have  different  reaction 
times. 

In  R ,  one  can  use  the  wilcox.test  command  to  obtain  the  results: 


coffee  <-  c(3.7,  4.9,  5.2,  6.3,  ...,  2.9) 
water  <-  c(4.5,  5.1,  6.2,  ...,  4.8) 
wilcox .test (coffee ,  water) 


The  output  is 


Wilcoxon  rank  sum  test 


data:  cof fee . sample  and  water. sample 

W  =  38,  p-value  =  0.6038 

alternative  hypothesis:  true  location  shift  is  not  equal  to  0 


We  can  see  that  the  null  hypothesis  is  not  rejected  because  p  =  0.6038  >  a  = 
0.05.  The  displayed  test  statistic  is  W  which  equates  to  our  statistic  U2.  The  alternative 
hypothesis  in  R  is  framed  as  location  shift,  an  interpretation  which  has  already  been 
given  earlier  in  the  chapter.  Note  that  the  test  also  suggests  that  the  medians  of  the 
two  samples  are  not  statistically  different. 


10.7  x2-Goodness-of-FitTest 
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Test  construction. 

The  x2-goodness-of-fit  test  is  one  of  the  most  popular  tests  for  testing  the  goodness 
of  fit  of  the  observed  data  to  a  distribution.  The  construction  principle  is  very  general 
and  can  be  used  for  variables  of  any  scale.  The  test  statistic  is  derived  such  that  the 
observed  absolute  frequencies  are  compared  with  the  expected  absolute  frequencies 
under  the  null  hypothesis  Hq. 

Example  10.7.1  Consider  an  experiment  where  a  die  is  rolled  n  =  60  times.  Under 
the  null  hypothesis  Ho,  we  assume  that  the  die  is  fair,  i.e.  pi  =  i  =  1,  2,  . . . ,  6, 
where  pi  =  P(X  =  i).  We  could  have  also  said  that  Ho  is  the  hypothesis  that  the 
rolls  are  following  a  discrete  uniform  distribution.  Thus,  the  expected  absolute  fre¬ 
quencies  under  Ho  are  npi  =  60  •  ^  =  10,  while  the  observed  frequencies  in  the 
sample  are  Ni ,  i  =  1,  2,  . . . ,  6.  The  Ni  generally  deviate  from  npi .  The  x2 -statistic 
is  based  on  the  squared  differences,  Xf=i  W  —  nPi )2>  and  becomes  large  as  the 
differences  between  the  observed  and  the  expected  frequencies  become  larger.  The 
X2-test  statistic  is  a  modification  of  this  sum  by  scaling  each  squared  difference  by 
the  expected  frequencies,  npi ,  and  is  explained  below. 

With  a  nominal  variable,  we  can  proceed  as  in  Example  10.7.1.  If  the  scale  of  the 
variable  is  ordinal  or  continuous,  the  number  of  different  values  can  be  large.  Note 
that  in  the  most  extreme  case,  we  can  have  as  many  different  values  as  observations 
(, n ),  leading  to  Ni  =  1  for  all  i  =  1,2,  ...  ,n.  Then,  it  is  necessary  to  group  the  data 
into  k  intervals  before  applying  the  x2-test.  The  reason  is  that  the  general  theory  of 
the  x2-test  assumes  that  the  number  k  (which  was  6  in  Example  10.7.1  above)  is 
fixed  and  does  not  grow  with  the  number  of  observations  n;  that  is,  the  theory  says 
that  the  x2-test  only  works  properly  if  k  is  fixed  and  n  is  large.  For  this  reason,  we 
group  the  sample  X  =  (X\ ,  X2,  . . . ,  Xn)  into  k  classes  as  shown  in  Sect.  2.1. 


Class 

1  2  •  •  •  k 

Total 

Number  of  observations 

n\  n2  ■■■  nk 

n 

The  choice  of  the  class  intervals  is  somewhat  arbitrary.  As  a  rule  of  thumb  npi  >5 
should  hold  for  most  class  intervals.  The  general  hypotheses  can  be  formulated  in 
the  form  of  distribution  functions: 


Ho  :  F(x)  =  Fq(x)  versus  H\  :  F(x)  7^  Fq(x). 


Test  statistic. 

The  test  statistic  is  defined  as 


r(X)  =  *(*)  =  x2 


npi) 


2 


(10.17) 
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Here, 

•  Ni  (i  =  1,  2, . . . ,  k)  are  the  absolute  frequencies  of  observations  of  the  sample  X 
in  class  i,  Ni  is  a  random  variable  with  realization  ri(  in  the  observed  sample; 

•  pi  (i  =  1,  2,  ...,&)  are  calculated  from  the  distribution  under  Ho ,  Fo(x),  and  are 
the  (hypothetical)  probabilities  that  an  observation  of  X  falls  in  class  i ; 

•  npi  are  the  expected  absolute  frequencies  in  class  i  under  Ho. 

Test  decision. 

For  a  significance  level  a,  Ho  is  rejected  if  t(x)  is  greater  than  the  (1  —  ct)-quantile 
of  the  x2 -distribution  with  k  —  1  —  r  degrees  of  freedom,  i.e.  if 

t(x)  =  X2  >  Ck-l—r,l—a- 

Note  that  r  is  the  number  of  parameters  of  Fo(x),  if  these  parameters  are  estimated 
from  the  sample.  The  x2-test  statistic  is  only  asymptotically  x2 -distributed  under 
Ho. 

Example  10.7.2  Let  Fo(x)  be  the  distribution  function  of  the  test  distribution.  If 
one  specifies  a  normal  distribution  such  as  Fo(v)  =  N (3,  10),  or  a  discrete  uniform 
distribution  with  pi  =0.25  (i  =  1,  2,  3,  4),  then  r  =  0,  since  no  parameters  have 
to  be  estimated  from  the  data.  Otherwise,  if  we  simply  want  to  test  whether  the 
data  is  generated  from  a  normal  distribution  N(p,  a2)  or  the  data  follows  a  normal 
distribution  N(p,  a2),  then  p  and  a2  may  be  estimated  from  the  sample  by  v  and  s2 . 
Then,  r  =  2  and  the  number  of  degrees  of  freedom  is  reduced. 

Example  10.7.3  Gregor  Mendel  (1822-1884)  conducted  crossing  experiments  with 
pea  plants  of  different  shape  and  colour.  Let  us  look  at  the  outcome  of  a  pea  crossing 
experiment  with  the  following  results: 


Crossing  result 

Round 

Round 

Edged 

Edged 

Yellow 

Green 

Yellow 

Green 

Observations 

315 

108 

101 

32 

Mendel  had  the  hypothesis  that  the  four  different  types  occur  in  proportions  of 
9:3:3: 1,  that  is 


Pi 


The  hypotheses  are 


1 

16' 


H0  :  P(X  =  i)  =  pi  versus  Hi  :  P(X  =  i)  ^  pt,  i  =  1,2,  3,4. 


With  n  =  556  observations,  the  test  statistic  can  be  calculated  from  the  following 
observed  and  expected  frequencies: 
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i 

Ni 

Pi 

npi 

1 

315 

9 

16 

312.75 

2 

108 

3 

16 

104.25 

3 

101 

3 

16 

104.25 

4 

32 

1 

16 

34.75 

The  x2-test  statistic  is  calculated  as 


(315  —  312.75)2 
312.75 


+  •••  + 


(32  -  34. 75)2 
34.75 


0.47. 


Since  x2  =  0-47  <  7.815  =  X095  3  =  <4).95,3>  the  null  hypothesis  is  not  rejected. 
Statistically,  there  is  no  evidence  that  Mendel  was  wrong  with  his  9:3:3: 1  assumption. 
In  R ,  the  test  can  be  conducted  by  applying  the  chisq.test  command: 


chisq. test (c (315 ,  108,  101,  32), 
p=c (9/16, 3/16, 3/16, 1/16)) 
qchisq(df =3 ,  p=0.95) 


which  leads  to  the  following  output 


Chi-squared  test  for  given  probabilities 

data:  c(315,  108,  101,  32) 

X-squared  =  0.47,  df  =  3,  p-value  =  0.9254 


and  the  critical  value  is 

[1]  7.814728 

Remark  10.7.1  In  this  example,  the  data  was  already  summarized  in  a  frequency 
table.  For  raw  data,  the  table  command  can  be  used  to  preprocess  the  data,  i.e.  we 
can  use  chisq.test (table (varl , var2) ) . 

Another  popular  goodness-of-fit  test  is  the  test  of  Kolmogorov-Smirnov.  There 
are  two  different  versions  of  this  test,  one  for  the  one- sample  scenario  and  one  for  the 
two- sample  scenario.  The  null  hypothesis  for  the  latter  is  that  the  two  independent 
samples  come  from  the  same  distribution.  In  R ,  the  command  ks.testO  can  be 
used  to  perform  Kolmogorov-Smirnov  tests. 
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1 0.8  %2-lndependence  Test  and  Other  %2 -Tests 

In  Chap.  4,  we  introduced  different  methods  to  describe  the  association  between 
two  variables.  Several  association  measures  are  possibly  suitable  if  the  variables  are 
categorical,  for  example  Cramer’s  V,  Goodman’s  and  Kruskal’s  7,  Spearman’s  rank 
correlation  coefficient,  and  the  odds  ratio.  If  we  are  not  interested  in  the  strength  of 
association  but  rather  in  finding  out  whether  there  is  an  association  at  all,  one  can 
use  the  x2 -independence  test. 

Test  construction. 

In  the  following  we  assume  that  we  observe  a  sample  from  a  bivariate  discrete 
distribution  of  two  variables  X  and  Y  which  can  be  summarized  in  a  contingency 
table  with  absolute  frequencies  riij,(i  =  1,2,...,/;  j  =  1,2...,/): 


Y 

1  2  ■■■  J 

X  1  n\i  n  12  •  •  •  nu 

771  + 

2  7121  ^22  •  •  •  ^27 

n2+ 

I  nn  nn  •  •  •  nu 

ni+ 

n+i  77+2  •  •  •  n+j 

77 

Remember  that 

ni+  is  the  i  th  row  sum, 

n+j  is  the  j  th  column  sum,  and 

n  is  the  total  number  of  observations. 


The  hypotheses  are  Hq:  X  and  Y  are  independent  versus  H\  :  X  and  Y  are  not 
independent.  If  X  and  Y  are  independent,  then  the  expected  frequencies  ntij  are 


A  /V 


nj+n+j 

n 


Test  statistic. 

Pearson’s  x2-test  statistic  was  introduced  in  Chap.  4,  Eq.  (4.6).  It  is 


(10.18) 


T(X,  Y)  =  x2 


where  ntij  =  riTTjj  =  mri+ir+j  (expected  absolute  cell  frequencies  under  Ho). 
Strictly  speaking,  ntij  are  the  true,  unknown  expected  frequencies  under  Ho  and 
are  estimated  by  ntij  =  nni+n+j,  such  that  the  realized  test  statistic  equates  to 


t(x,  y)  =  x2 


i  =  1  7  =  1 


(10.19) 
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Test  decision. 

The  number  of  degrees  of  freedom  under  Ho  is  (/  —  1)(/  —  1),  where  I  —  1  are  the 
parameters  which  have  to  be  estimated  for  the  marginal  distribution  of  X ,  and  J  —  l 
are  the  number  of  parameters  for  the  marginal  distribution  of  Y .  The  test  decision  is: 

Reject  H0,  if t(x,  y)  =  X  >  c(/-i)(/-i);i-a- 

Note  that  the  alternative  hypothesis  H \  is  very  general.  If  Ho  is  rejected,  nothing  can 
be  said  about  the  structure  of  the  dependence  of  X  and  Y  from  the  x2 -value  itself. 

Example  10.8.1  Consider  the  following  contingency  table.  Here,  X  describes  the 
educational  level  (1:  primary,  2:  secondary,  3:  tertiary)  and  Y  the  preference  for  a 
specific  political  party  (1 :  Party  A,  2:  Party  B,  3:  Party  C).  Our  null  hypothesis  is  that 
the  two  variables  are  independent,  and  we  want  to  show  the  alternative  hypothesis 
which  says  that  there  is  a  relationship  between  them. 


Y 

1  2  3 

Total 

X  1 

100  200  300 

600 

2 

100  100  100 

300 

3 

80  10  10 

100 

Total  280  310  410 

1000 

For  the  (estimated)  expected  frequencies  rhij  =  ,  we  get 


Y 

1 

2 

3 

X  1 

168 

186 

246 

2 

84 

93 

123 

3 

28 

31 

41 

For  example:  rh\\  =  600  •  280/1000  =  168.  The  test  statistic  is 


I  J  /  ^  \2 


t(x,  y )  = 


(tijj  -  mu) 


'j 


i  =  1  7  =  1 
(100  -  168)2 
168 


Mij 


+  •••  + 


(10-41)' 

41 


182.54. 


Since  X4  095  =  9.49  <  t(x,  y )  =  182.54,  Ho  is  rejected. 
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In  R ,  either  the  summarized  data  (as  shown  below)  can  be  used  to  calcu¬ 
late  the  test  statistic  or  the  raw  data  (summarized  in  a  contingency  table  via 
table  (varl ,  var2)): 

ct  <-  matrix (nrow=3,ncol=3,byrow=T, 
data=c (100, 200, 300, 100, 100, 100, 80, 10, 10)) 
chisq. test (ct) 

qchisq(df =(3-1) * (3-1) ,  p=0.95) 

The  output  is 


Pearson’s  Chi-squared  test 


data:  contingency . table 

X-squared  =  182.5428,  df  =  4,  p-value  <  2.2e-16 
with  the  critical  value 

[1]  9.487729 

which  confirms  our  earlier  manual  calculations.  The  p-value  is  smaller  than  a  =  0.05 
which  further  confirms  that  the  null  hypothesis  has  to  be  rejected. 

For  a  binary  outcome,  the  x2-test  of  independence  can  be  formulated  as  a  test  for 
the  null  hypothesis  that  the  proportions  of  the  binary  variable  are  equal  in  several 
(>2)  groups,  i.e.  for  a  K  x  2  (or  2  x  K)  table.  This  test  is  called  the  %2-test  of 
homogeneity. 

Example  10.8.2  Consider  two  variables  X  and  Y,  where  X  is  describing  the  rating 
of  a  coffee  brand  with  the  categories  “bad  taste”  and  “good  taste”  and  Y  denotes 
three  age  subgroups,  e.g.  “18-25”,  “25-35”,  and  “35-45”.  The  observed  data  is 


18-25 

Y 

25-35 

35-45 

Total 

X  Bad 

10 

30 

65 

105 

Good 

90 

70 

35 

195 

Total 

100 

100 

100 

300 

Assume  Ho  is  the  hypothesis  that  the  probabilities  P(X  =  ‘good’|y  =  ‘18-25’), 
P(X  =  ‘good’|T  =  ‘25-35’),  and  P(X  =  ‘good’|T  =  ‘35-45’)  are  all  equal.  Then, 
we  can  use  the  function  either  prop. test  or  chisq. test  in  R  to  test  this 
hypothesis: 


241 


1 0.8  x2“lndependence Test  and  Other  x2 -Tests 


prop . test (x=rbind(c (10 , 30 , 65) ,  c(90,70,35)  )) 
chisq. test (x=rbind(c (10 , 30,65) ,  c(90,70,35)  )) 

This  produces  the  following  outputs: 

3-sample  test  for  equality  of  proportions 

data:  cbind(c(10,  30,  65),  c(90,  70,  35)) 

X-squared  =  68.1319,  df  =  2,  p-value  =  1.605e-15 
alternative  hypothesis:  two. sided 
sample  estimates: 
prop  1  prop  2  prop  3 
0.10  0.30  0.65 


and 


Pearson’s  Chi-squared  test 

data:  cbind(c(10,  30,  65),  c(90,  70,  35)) 

X-squared  =  68.1319,  df  =  2,  p-value  =  1.605e-15 

The  results  (test  statistic,  p-value)  are  identical  and  Ho  is  rejected.  Note  that 
prop  .test  strictly  expects  a  K  x  2  table  (i.e.  exactly  2  columns). 

Remark  10.8.1  For  2  x  2-tables  with  small  sample  sizes  and  therefore  small  cell 
frequencies,  it  is  recommended  to  use  the  exact  test  of  Fisher  as  described  in  Appen¬ 
dix  C.5. 

Remark  10.8.2  The  test  described  in  Example  10.8.2  is  a  special  case  (since  one 
variable  is  binary)  of  the  general  x2-test  of  homogeneity.  The  x2-test  of  homogeneity 
is  valid  for  any  K  x  C  table,  where  K  is  the  number  of  subgroups  of  a  variable  Y 
and  C  is  the  number  of  values  of  the  outcome  X  of  interest.  The  null  hypothesis  Ho 
assumes  that  the  conditional  distributions  of  X  given  Y  are  identical  in  all  subgroups, 
i.e. 

P(X  =  xc\ Y  =  yk)  =  P(X  =  xc\ Y  =  yv) 

forall  c  =  1,  2, . . . ,  C;  k,  k'  =  1,2 ,  . . . ,  K,k  ^  k'.  Again,  the  usual  x2-test  statistic 
can  be  used. 
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1 0.9  Key  Points  and  Further  Issues 


Note: 


/  A  graphical  summary  on  when  to  use  the  tests  introduced  in  this  chapter 
is  given  in  Appendices  D.2  and  D.3. 

/  To  arrive  at  a  test  decision,  i.e.  accept  Ho  or  reject  it,  it  does  not  matter 
whether  one  compares  the  test  statistic  to  the  critical  region,  one  uses  the 
p-value  obtained  from  statistical  software,  or  one  evaluates  the  appropri¬ 
ate  confidence  interval.  However,  it  is  important  not  to  misinterpret  the 
p-\ alue  (see  Sect.  10.2.6)  and  to  choose  the  correct  confidence  interval. 

/  There  is  a  difference  between  relevance  and  significance.  A  test  might  be 
significant,  but  the  point  estimate  of  the  quantity  of  interest  may  not  be 
relevant  from  a  substantive  point  of  view.  Similarly,  a  test  might  not  be 
significant,  but  the  point  and  interval  estimates  may  still  yield  relevant 
conclusions. 

/  The  test  statistic  of  the  t-test  (one-sample,  two-sample,  paired)  is  asymp¬ 
totically  normally  distributed.  This  means  that  for  relatively  large  n  (as 
a  rule  of  thumb  >30  per  group)  the  sample  does  not  need  to  come  from  a 
normal  distribution.  However,  the  application  of  the  t-test  makes  sense 
only  when  the  expectation  p  can  be  interpreted  meaningfully;  this  may 
not  be  the  case  for  skewed  distributions  or  distributions  with  outliers. 


10.10  Exercises 

Exercise  10.1  Two  people,  A  and  B,  are  suspects  for  having  committed  a  crime 
together.  Both  of  them  are  interrogated  in  separate  rooms.  The  jail  sentence  depends 
on  who  confesses  to  have  committed  the  crime,  and  who  does  not: 


B  does  not  confess  B  does  confess 

A  does  not  confess 

A  does  confess 

Each  serves  1  year  A:  3  years;  B:  goes  free 

A:  goes  free;  B:  3  years  Each  serves  2 years 

A  has  two  hypotheses: 

Ho  :  B  does  not  confess  versus  H\  :  B  does  confess. 

Given  the  possible  sentences  he  decides  to  not  confess  if  Ho  is  true  and  to  confess 
otherwise.  Explain  the  concepts  of  type  I  error  and  type  II  error  for  this  situation. 
Comment  on  the  consequences  if  these  errors  are  made. 
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Exercise  10.2  A  producer  of  chocolate  bars  hypothesizes  that  his  production  does 
not  adhere  to  the  weight  standard  of  100  g.  As  a  measure  of  quality  control,  he  weighs 
15  bars  and  obtains  the  following  results  in  grams: 

96.40,  97.64,  98.48,  97.67,  100.11,  95.29,  99.80,  98.80,  100.53,  99.41,  97.64, 

101.11,93.43,  96.99,  97.92 

It  is  assumed  that  the  production  process  is  standardized  in  the  sense  that  the  variation 
is  controlled  to  be  a  =  2. 

(a)  What  are  the  hypotheses  regarding  the  expected  weight  fi  for  a  two-sided  test? 

(b)  Which  test  should  be  used  to  test  these  hypotheses? 

(c)  Conduct  the  test  that  was  suggested  to  be  used  in  (b).  Use  a  =  0.05. 

(d)  The  producer  wants  to  show  that  the  expected  weight  is  smaller  than  100  g.  What 
are  the  appropriate  hypotheses  to  use? 

(e)  Conduct  the  test  for  the  hypothesis  in  (d).  Again  use  a  =  0.05. 

Exercise  10.3  Christian  decides  to  purchase  the  new  CD  by  Bruce  Springsteen.  His 
first  thought  is  to  buy  it  online,  via  an  online  auction.  He  discovers  that  he  can  also 
buy  the  CD  immediately,  without  bidding  at  an  auction,  from  the  same  online  store. 
He  also  looks  at  the  price  at  an  internet  book  store  which  was  recommended  to  him 
by  a  friend.  He  notes  down  the  following  prices  (in  €): 

Internet  book  store  16.95 

Online  store,  no  auction  18.19,  16.98,  19.97,  16.98,  18.19,  15.99,  13.79,  15.90, 
15.90,  15.90,  15.90,  15.90,  19.97,  17.72 

Online  store,  auction  10.50,  12.00,  9.54,  10.55,  11.99,  9.30,  10.59,  10.50,  10.01, 
11.89,  11.03,9.52,  15.49,  11.02 

(a)  Calculate  and  interpret  the  arithmetic  mean,  variance,  standard  deviation,  and 
coefficient  of  variation  for  the  online  store,  both  for  the  auction  and  non-auction 
offers. 

(b)  Test  the  hypothesis  that  the  mean  price  at  the  online  store  (no  auction)  is  unequal 
to  =€16.95  ( a  =  0.05). 

(c)  Calculate  a  confidence  interval  for  the  mean  price  at  the  online  store  (no  auction) 
and  interpret  your  findings  in  the  light  of  the  hypothesis  in  (b). 

(d)  Test  the  hypothesis  that  the  mean  price  at  the  online  store  (auction)  is  less  than 
€16.95  (a  =  0.05). 

(e)  Test  the  hypothesis  that  the  mean  non-auction  price  is  higher  than  the  mean 
auction  price.  Assume  that  (i)  the  variances  are  equal  in  both  samples  and  (ii) 
the  variances  are  unequal  (a  =  0.05). 

(f)  Test  the  hypothesis  that  the  variance  of  the  non-auction  price  is  unequal  to  the 
variance  of  the  auction  price  ( a  =  0.05). 
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(g)  Use  the  U-test  to  compare  the  location  of  the  auction  and  non-auction  prices. 
Compare  the  results  with  those  of  (e). 

(h)  Calculate  the  results  of  (a)-(g)  with  R. 

Exercise  10.4  Ten  of  Leonard’s  best  friends  try  a  new  diet:  the  “Banting”  diet.  Each 
of  them  weighs  him/herself  before  and  after  the  diet.  The  data  is  as  follows: 


Person  (/) 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Before  diet  (xj) 

80 

95 

70 

82 

71 

70 

120 

13 

nn 

90 

After  diet  (y; ) 

78 

94 

69 

83 

65 

69 

118 

■9 

B9 

88 

Choose  a  test  and  a  confidence  interval  to  test  whether  there  is  a  difference  between 
the  mean  weight  before  and  after  the  diet  ( a  =  0.05). 

Exercise  10.5  A  company  producing  clothing  often  finds  deficient  T-shirts  among 
its  production. 

(a)  The  company’s  controlling  department  decides  that  the  production  is  no  longer 
profitable  when  there  are  more  than  10  %  deficient  shirts.  A  sample  of  230  shirts 
yields  30  shirts  which  contain  deficiencies.  Use  the  approximate  binomial  test 
to  decide  whether  the  T-shirt  production  is  profitable  or  not  ( a  =  0.05). 

(b)  Test  the  same  hypothesis  as  in  (a)  using  the  exact  binomial  test.  You  can  use  R 
to  determine  the  quantiles  needed  for  the  calculation. 

(c)  The  company  is  offered  a  new  cutting  machine.  To  test  whether  the  change  of 
machine  helps  to  improve  the  production  quality,  115  sample  shirts  are  evaluated, 
7  of  which  have  deficiencies.  Use  the  two-sample  binomial  test  to  decide  whether 
the  new  machine  yields  improvement  or  not  ( a  =  0.05). 

(d)  Test  the  same  hypothesis  as  in  (c)  using  the  test  of  Fisher  in  R. 

Exercise  10.6  Two  friends  play  a  computer  game  and  each  of  them  repeats  the  same 
level  10  times.  The  scores  obtained  are: 


1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Player  1 

91 

101 

112 

99 

108 

88 

99 

105 

111 

104 

Player  2 

261 

47 

40 

29 

64 

6 

87 

47 

98 

351 

(a)  Player  2  insists  that  he  is  the  better  player  and  suggests  to  compare  their  mean 
performance.  Use  an  appropriate  test  (a  =  0.05)  to  test  this  hypothesis. 

(b)  Player  1  insists  that  he  is  the  better  player.  He  proposes  to  not  focus  on  the  mean 
and  to  use  the  U -test  for  comparison.  What  are  the  advantages  and  disadvantages 
of  using  this  test  compared  with  (a)?  What  are  the  results  (a  =  0.05)7 
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Exercise  10.7  Otto  loves  gummy  bears  and  buys  10  packets  at  a  factory  store.  He 
opens  all  packets  and  sorts  them  by  their  colour.  He  counts  222  white  gummy  bears, 
279  red  gummy  bears,  251  orange  gummy  bears,  232  yellow  gummy  bears,  and 
266  green  ones.  He  is  disappointed  since  white  (pineapple  flavour)  is  his  favourite 
flavour.  He  hypothesizes  that  the  producer  of  the  bears  does  not  uniformly  distribute 
the  bears  into  the  packets.  Choose  an  appropriate  test  to  find  out  whether  Otto’s 
speculation  could  be  true. 


Exercise  10.8  We  consider  Exercise  4.4  where  we  evaluated  which  of  the  passengers 
from  the  Titanic  were  rescued.  The  data  was  summarized  as  follows: 


1.  Class  2.  Class  3.  Class  Staff 

Total 

Rescued 

Not  rescued 

202  125  180  211 

135  160  541  674 

718 

1510 

(a)  The  hypothesis  derived  from  the  descriptive  analysis  was  that  travel  class  and 
rescue  status  are  not  independent.  Test  this  hypothesis. 

(b)  Interpret  the  following  R  output: 

4-sample  test  for  equality  of  proportions 
data:  titanic 

X-squared  =  182.06,  df  =  3,  p-value  <  2.2e-16 
alternative  hypothesis:  two. sided 
sample  estimates: 

prop  1  prop  2  prop  3  prop  4 
0.5994065  0.4385965  0.2496533  0.2384181 

(c)  Summarize  the  data  in  a  2x2  table:  passengers  from  the  first  and  second  class 
should  be  grouped  together,  and  third  class  passengers  and  staff  should  be 
grouped  together  as  well.  Is  the  probability  of  being  rescued  higher  in  the  first 
and  second  class?  Provide  an  answer  using  the  following  three  tests:  exact  test 
of  Fisher,  x2 -independence  test,  and  x2 -homogeneity  test.  You  can  use  R  to 
conduct  the  test  of  Fisher. 

Exercise  10.9  We  are  interested  in  understanding  how  well  the  t- test  can  detect 
differences  with  respect  to  the  mean.  We  use  R  to  draw  3  samples  each  of  20  obser¬ 
vations  from  three  different  normal  distributions:  X  ~  N(5,  22),  Y\  ~  N(4,  22),  and 
Y2  ~  N (3.5,  22).  The  summary  statistics  of  this  experiment  are  as  follows: 
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•  X  =  4.97,  s2  =  2.94, 

•  yi  =  4.55,  s2  —  2.46, 

•  yi  —  3.27,  s22  =  3.44. 

(a)  Use  the  t- test  to  compare  the  means  of  X  and  Y\. 

(b)  Use  the  t- test  to  compare  the  means  of  X  and  Y2. 

(c)  Interpret  the  results  from  (a)  and  (b). 

Exercise  10.10  Access  the  theatre  data  described  in  Appendix  A.4.  The  data  sum¬ 
marizes  a  survey  conducted  on  visitors  of  a  local  Swiss  theatre  in  terms  of  age,  sex, 
annual  income,  general  expenditure  on  cultural  activities,  expenditure  on  theatre  vis¬ 
its,  and  the  estimated  expenditure  on  theatre  visits  in  the  year  before  the  survey  was 
done. 

(a)  Compare  the  mean  expenditure  on  cultural  activities  for  men  and  women  using 
the  Welch  test  ( a  =  0.05). 

(b)  Would  the  conclusions  change  if  the  two- sample  t- test  or  the  U -test  were  used 
for  comparison? 

(c)  Test  the  hypothesis  that  women  spend  on  average  more  money  on  theatre  visits 
than  men  ( a  =  0.05). 

(d)  Compare  the  mean  expenditure  on  theatre  visits  in  the  year  of  the  survey  and  the 
preceding  year  ( a  =  0.05). 

Exercise  10.11  Use  R  to  read  in  and  analyse  the  pizza  data  described  in  Appen¬ 
dix  A.4  (assume  a  =  0.05). 

(a)  The  manager’s  aim  is  to  deliver  pizzas  in  less  than  30  min  and  with  a  temperature 
of  greater  than  65  °C.  Use  an  appropriate  test  to  evaluate  whether  these  aims  have 
been  reached  on  average. 

(b)  If  it  takes  longer  than  40  min  to  deliver  the  pizza,  then  the  customers  are  promised 
a  free  bottle  of  wine.  This  offer  is  only  profitable  if  less  than  15  %  of  deliveries 
are  too  late.  Test  the  hypothesis  p  <  0.15. 

(c)  The  manager  wonders  whether  there  is  any  relationship  between  the  operator 
taking  the  phone  call  and  the  pizza  temperature.  Assume  that  a  hot  pizza  is 
defined  to  be  one  with  a  temperature  greater  65  °C.  Use  the  test  of  Fisher,  the 
X2 -independence  test,  and  the  x2-test  of  homogeneity  to  test  his  hypothesis. 

(d)  Each  branch  employs  the  same  number  of  staff.  It  would  thus  be  desirable  if  each 
branch  receives  the  same  number  of  orders.  Use  an  appropriate  test  to  investigate 
this  hypothesis. 

(e)  Is  the  proportion  of  calls  taken  by  each  operator  the  same  in  each  branch? 

(f)  Test  whether  there  is  a  relationship  between  drivers  and  branches. 
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Exercise  10.12  The  authors  of  this  book  went  to  visit  historical  sites  in  India.  None  of 
them  has  a  particularly  strong  interest  in  photography,  and  they  speculated  that  each 
of  them  would  take  about  the  same  number  of  pictures  on  their  trip.  After  returning 
home,  they  counted  110,  118,  and  105  pictures,  respectively.  Use  an  appropriate  test 
to  find  out  whether  their  speculation  was  correct  (a  =  0.01). 


Solutions  to  all  exercises  in  this  chapter  can  be  found  on  p.  393 


"Source  Toutenburg,  H.,  Heumann,  C.,  Induktive  Statistik ,  4th  edition,  2007, 
Springer,  Heidelberg 
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We  learnt  about  various  measures  of  association  in  Chap.  4.  Such  measures  are 
used  to  understand  the  degree  of  relationship  or  association  between  two  variables. 
The  correlation  coefficient  of  Bravais-Pearson,  for  example,  can  help  us  to  quantify 
the  degree  of  a  linear  relationship,  but  it  does  not  tell  us  about  the  functional  form 
of  a  relationship.  Moreover,  any  association  of  interest  may  depend  on  more  than 
one  variable,  and  we  would  therefore  like  to  explore  multivariate  relationships.  For 
example,  consider  a  situation  where  we  measured  the  body  weights  and  the  long  jump 
results  (distances)  of  school  children  taking  part  in  a  competition.  The  correlation 
coefficient  of  Bravais-Pearson  between  the  body  weight  and  distance  jumped  can 
tell  us  how  strong  or  weak  the  association  between  the  two  variables  is.  It  can  also 
tell  us  whether  the  distance  jumped  increases  or  decreases  with  the  body  weight  in 
the  sense  that  a  negative  correlation  coefficient  indicates  shorter  jumps  for  higher 
body  weights,  whereas  a  positive  coefficient  implies  longer  jumps  for  higher  body 
weights.  Suppose  we  want  to  know  how  far  a  child  with  known  body  weight,  say 
50  kg,  will  jump.  Such  questions  cannot  be  answered  from  the  value  and  sign  of  the 
correlation  coefficient.  Another  question  of  interest  may  be  to  explore  whether  the 
relationship  between  the  body  weight  and  distance  jumped  is  linear  or  not.  One  may 
also  be  interested  in  the  joint  effect  of  age  and  body  weight  on  the  distance  jumped. 
Could  it  be  that  older  children,  who  have  on  average  a  higher  weight  than  younger 
children,  perform  better?  What  would  be  the  association  of  weight  and  the  long  jump 
results  of  children  of  the  same  age?  Such  questions  are  addressed  by  linear  regression 
analysis  which  we  introduce  in  this  chapter. 

In  many  situations,  the  outcome  does  not  depend  on  one  variable  but  on  several 
variables.  For  example,  the  recovery  time  of  a  patient  after  an  operation  depends  on 
several  factors  such  as  weight,  haemoglobin  level,  blood  pressure,  body  temperature, 
diet  control,  rehabilitation,  and  others.  Similarly,  the  weather  depends  on  many  fac¬ 
tors  such  as  temperature,  pressure,  humidity,  speed  of  winds,  and  others.  Multiple 
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(a)  Positive  linear  relationship  (b)  No  clear  relationship 

Fig.  11.1  Scatter  plots 


linear  regression,  introduced  in  Sect.  11.6,  addresses  the  issue  where  the  outcome 
depends  on  more  than  one  variable. 

To  begin  with,  we  consider  only  two  quantitative  variables  X  and  Y  in  which  the 
outcome  Y  depends  on  X  and  we  explore  the  quantification  of  their  relationship. 

Examples  Examples  of  associations  in  which  we  might  be  interested  in  are: 

•  body  height  (X)  and  body  weight  (7)  of  persons, 

•  speed  (X)  and  braking  distance  (7)  measured  on  cars, 

•  money  invested  (in  €)  in  the  marketing  of  a  product  (X)  and  sales  figures  for  this 
product  (in  €)  (7)  measured  in  various  branches, 

•  amount  of  fertilizer  used  (X)  and  yield  of  rice  (7)  measured  on  different  acres, 
and 

•  temperature  (X)  and  hotel  occupation  (7)  measured  in  cities. 


1 1 .1  The  Linear  Model 

Consider  the  scatter  plots  from  Fig.  4.2  on  p.  80.  Plotting  X-values  against  7-values 
enables  us  to  visualize  the  relationship  between  two  variables.  Figure  1 1.1a  reviews 
what  a  positive  (linear)  association  between  X  and  7  looks  like:  the  higher  the  X 
values,  the  higher  the  7 -values  (and  vice  versa).  The  plot  indicates  that  there  may 
be  a  linear  relationship  between  X  and  7.  On  the  other  hand,  Fig.  11.1b  displays  a 
scatter  plot  which  shows  no  clear  relationship  between  X  and  7.  The  R2  measure, 
shown  in  the  two  figures  and  explained  in  more  detail  later,  equates  to  the  squared 
correlation  coefficient  of  Bravais-Pearson. 
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To  summarize  the  observed  association  between  X  and  Y,  we  can  postulate  the 
following  linear  relationship  between  them: 

Y  =  ot  +  pX.  (11.1) 

This  Eq.  (11.1)  represents  a  straight  line  where  a  is  the  intercept  and  P  represents 
the  slope  of  the  line.  The  slope  indicates  the  change  in  the  Y -value  when  the  X-value 
changes  by  one  unit.  If  the  sign  of  P  is  positive,  it  indicates  that  the  value  of  Y 
increases  as  the  value  of  X  increases.  If  the  sign  of  P  is  negative,  it  indicates  that  the 
value  of  Y  decreases  as  the  value  of  X  increases.  When  X  =  0,  then  Y  =  a.  If  a  =  0, 
then  Y  =  pX  represents  a  line  passing  through  the  origin.  Suppose  the  height  and 
body  weights  in  the  example  of  school  children  are  represented  in  Fig.  11.1a.  This 
has  the  following  interpretation:  when  the  height  of  a  child  increases  by  1  cm,  then 
the  body  weight  increases  by  ft  kilograms.  The  slope  ft  in  Fig.  1 1.1a  would  certainly 
be  positive  because  we  have  a  positive  linear  relationship  between  X  and  Y.  Note 
that  in  this  example,  the  intercept  term  has  no  particular  meaning  because  when  the 
height  X  =  0,  the  body  weight  Y  =  a  =  0.  The  scatter  diagram  in  Fig.  11.1b  does 
not  exhibit  any  clear  relationship  between  X  and  Y.  Still,  the  slope  of  a  possible  line 
would  likely  be  somewhere  around  0. 

It  is  obvious  from  Fig.  11.1a  that  by  assuming  a  linear  relationship  between  X 
and  Y,  any  straight  line  will  not  exactly  match  the  data  points  in  the  sense  that  it 
cannot  pass  through  all  the  observations:  the  observations  will  lie  above  and  below 
the  line.  The  line  represents  a  model  to  describe  the  process  generating  the  data. 

In  our  context,  a  model  is  a  mathematical  representation  of  the  relationship 
between  two  or  more  variables.  A  model  has  two  components — variables  (e.g.  X,  Y) 
and  parameters  (e.g.  a,  P).  A  model  is  said  to  be  linear  if  it  is  linear  in  its  parameters. 
A  model  is  said  to  be  nonlinear  if  it  is  nonlinear  in  its  parameters  (e.g.  p2  instead 
of  P).  Now  assume  that  each  observation  potentially  deviates  by  ei  from  the  line  in 
Fig.  11.1a.  The  linear  model  in  Eq.  (11.1)  can  then  be  written  as  follows  to  take  this 
into  account: 

Y  =  a  +  px  +  e.  (11.2) 

Suppose  we  have  n  observations  (x\ ,  yi),  (x2,  y 2),  •  •  • ,  (xn,yn),  then  each  obser¬ 
vation  satisfies 

yi  =  a  +  p*i  +  et.  (11.3) 

Each  deviation  ei  is  called  an  error.  It  represents  the  deviation  of  the  data  points 
(xL ,  yt  )  from  the  regression  line.  The  line  in  Fig.  11.1a  is  the  fitted  (regression) 
line  which  we  will  discuss  in  detail  later.  We  assume  that  the  errors  ei  are  iden¬ 
tically  and  independently  distributed  with  mean  0  and  constant  variance  a2,  i.e. 
F (et)  =  0,  Var(£;)  =  o2  for  all  i  =  1,  2,  . . . ,  n.  We  will  discuss  these  assumptions 
in  more  detail  in  Sect.  1 1.7. 

In  the  model  (11.2),  Y  is  called  the  response,  response  variable,  dependent 
variable  or  outcome;  X  is  called  the  co variate,  regressor  or  independent  variable. 
The  scalars  a  and  p  are  the  parameters  of  the  model  and  are  described  as  regression 
coefficients  or  parameters  of  the  linear  model.  In  particular,  a  is  called  the  intercept 
term  and  P  is  called  the  slope  parameter. 
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It  may  be  noted  that  if  the  regression  parameters  a  and  /3  are  known,  then  the  linear 
model  is  completely  known.  An  important  objective  in  identifying  the  model  is  to 
determine  the  values  of  the  regression  parameters  using  the  available  observations 
for  X  and  Y.  It  can  be  understood  from  Fig.  11.1a  that  an  ideal  situation  would  be 
when  all  the  data  points  lie  exactly  on  the  line  or,  in  other  words,  the  error  et  is  zero 
for  each  observation.  It  seems  meaningful  to  determine  the  values  of  the  parameters 
in  such  a  way  that  the  errors  are  minimized. 

There  are  several  methods  to  estimate  the  values  of  the  regression  parameters.  In 
the  remainder  of  this  chapter,  we  describe  the  methods  of  least  squares  and  maximum 
likelihood  estimation. 


1 1 .2  Method  of  Least  Squares 

Suppose  n  sets  of  observations  Pi  =  (xi ,  yt),  i  =  1,  2, . . . ,  n,  are  obtained  on  two 
variables  P  =  ( X ,  T)  and  are  plotted  in  a  scatter  plot.  An  example  of  four  observa¬ 
tions,  (vi,  y i),  fa,  y2),  ( X3 ,  yj),  and  (*4,  y4),  is  given  in  Fig.  11.2. 

The  method  of  least  squares  says  that  a  line  can  be  fitted  to  the  given  data  set  such 
that  the  errors  are  minimized.  This  implies  that  one  can  determine  a  and  /3  such  that 
the  sum  of  the  squared  distances  between  the  data  points  and  the  line  Y  =  a  +  /3X 
is  minimized.  For  example,  in  Fig.  11.2,  the  first  data  point  (x\,  yi)  does  not  lie  on 
the  plotted  line  and  the  deviation  is  e\  =  y\  —  (a  +  fix i).  Similarly,  we  obtain  the 


Fig.  1 1 .2  Regression  line,  observations,  and  errors  e* 
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difference  of  the  other  three  data  points  from  the  line:  ei,  £3,  The  error  is  zero  if 
the  point  lies  exactly  on  the  line.  The  problem  we  would  like  to  solve  in  this  example 
is  to  minimize  the  sum  of  squares  of  e\ ,  £2,  £3,  and  £4,  i.e. 

4 

min  V(>v  -  a  -  fixi )2 .  ( 1 1 .4) 

ot.fi  . 

l  —  \ 


We  want  the  line  to  fit  the  data  well.  This  can  generally  be  achieved  by  choosing 
a  and  ft  such  that  the  squared  errors  of  all  the  n  observations  are  minimized: 


n 


n 


min  f  ef  =  min  ^7y/  —  a  —  /3x/)' 

/  — 1  i  —  \ 


(11.5) 


If  we  solve  this  optimization  problem  by  the  principle  of  maxima  and  minima,  we 
obtain  estimates  of  a  and  ft  as 


/V 


>xx 


Xfo -*)(#-?)  _  T!i=\*j  yi-nxy 


z; 


=  1  X  i 


-nx< 


a  =  y  —  bx 


(11.6) 


see  Appendix  C.6  for  a  detailed  derivation.  Here,  a  and  ft  represent  the  estimates  of 
the  parameters  a  and  ft ,  respectively,  and  are  called  the  least  squares  estimator  of 

_  A 

a  and  /3 ,  respectively.  This  gives  us  the  model  y  =  a  +  /3x  which  is  called  the  fitted 
model  or  the  fitted  regression  line .  The  literal  meaning  of  “regression”  is  to  move 
back.  Since  we  are  acquiring  the  data  and  then  moving  back  to  find  the  parameters 
of  the  model  using  the  data,  it  is  called  a  regression  model.  The  fitted  regression 
line  y  =  a  +  fix  describes  the  postulated  relationship  between  Y  and  X.  The  sign 
of  determines  whether  the  relationship  between  X  and  Y  is  positive  or  negative. 
If  the  sign  of  ft  is  positive,  it  indicates  that  if  X  increases,  then  Y  increases  too.  On 
the  other  hand,  if  the  sign  of  ft  is  negative,  it  indicates  that  if  X  increases,  then  Y 
decreases.  For  any  given  value  of  X,  say  jtj,  the  predicted  value  y*  is  calculated  by 

yt=a  + 


and  is  called  the  i th  fitted  value. 

If  we  compare  the  observed  data  point  (x; ,  y; )  with  the  point  suggested  (predicted, 
fitted)  by  the  regression  line,  (x;,  y;),  the  difference  between  y;  and  y;  is  called  the 
residual  and  is  given  as 

h  —  yt  -  yt  =  yt  -  (a  +  /3x;).  (11.7) 


This  can  be  viewed  as  an  estimator  of  the  error  et .  Note  that  it  is  not  really  an  estimator 
in  the  true  statistical  sense  because  e*  is  random  and  so  it  cannot  be  estimated. 
However,  since  the  ef  are  unknown  and  Si  measures  the  same  difference  between  the 
estimated  and  true  values  of  the  y’s,  see  for  example  Fig.  11.2,  it  can  be  treated  as 
estimating  the  error  ^ . 
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Example  11.2.1  A  physiotherapist  advises  12  of  his  patients,  all  of  whom  had  the 
same  knee  surgery  done,  to  regularly  perform  a  set  of  exercises.  He  asks  them  to 
record  how  long  they  practise.  He  then  summarizes  the  average  time  they  practised 
( X ,  time  in  minutes)  and  how  long  it  takes  them  to  regain  their  full  range  of  motion 
again  (7,  time  in  days).  The  results  are  as  follows: 


i 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

Xi 

24 

35 

64 

20 

33 

27 

42 

41 

22 

50 

36 

31 

yt 

90 

65 

30 

60 

60 

80 

45 

45 

80 

35 

50 

45 

/V 

To  estimate  the  linear  regression  line  y  =  a  +  fix,  we  first  calculate  x  =  35.41 
and  y  =  57.08.  To  obtain  Sxy  and  Sxx  we  need  the  following  table: 


i 

Xi 

yt 

(xi  -  x) 

<j;  -  y) 

(Xi  -  x)(yt  -  y) 

(Xi  -  x)2 

1 

24 

90 

-11.41 

32.92 

-375.61 

130.19 

2 

35 

65 

-0.41 

7.92 

-3.25 

0.17 

3 

64 

30 

28.59 

-27.08 

-774.22 

817.39 

4 

20 

60 

-15.41 

2.92 

-45.00 

237.47 

5 

33 

60 

-2.41 

2.92 

-7.27 

5.81 

6 

27 

80 

-8.41 

22.92 

-192.75 

70.73 

7 

42 

45 

6.59 

-12.08 

-79.61 

43.43 

8 

41 

45 

5.59 

-12.08 

-67.53 

31.25 

9 

22 

80 

-13.41 

22.92 

-307.36 

179.83 

10 

50 

35 

14.59 

-22.08 

-322.14 

212.87 

11 

36 

50 

0.59 

-7.08 

-4.18 

0.35 

12 

31 

45 

-4.41 

-12.08 

53.27 

19.45 

Total 

-2125.65 

1748.94 

Using  (11.6),  we  can  now  easily  find  the  least  squares  estimates  a  and  /3  as 
2  =  ?*y_=  Z (xj-x)(yi-y)  =  -2125.65  _  22 

P  Sxx  Z(*/-*)2  1748.94  ~  '  ’ 

a  =  y  -  fix  =  57.08  -  (-1.215)  •  35.41  =  100.28. 


The  fitted  regression  line  is  therefore 

y  =  100.28  —  1.22  •  x. 
We  can  interpret  the  results  as  follows: 


•  For  an  increase  of  1  min  in  exercising,  the  recovery  time  decreases  by  1.22  days 

a  _ 

because  /3  =  —  1 .22.  The  negative  sign  of  /3  indicates  that  the  relationship  between 
exercising  time  and  recovery  time  is  negative;  i.e.  as  exercise  time  increases,  the 
recovery  time  decreases. 

•  When  comparing  two  patients  with  a  difference  in  exercising  time  of  10  min, 
the  linear  model  estimates  a  difference  in  recovery  time  of  12.2  days  because 
10  •  1.22  =  12.2. 


11.2  Method  of  Least  Squares 
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Fig.  11.3  Scatter  plot  and 
regression  line  for 
Example  11.2.1 


Exercise  Time 


•  The  model  predicts  an  average  recovery  time  of  100.28—1.22  •  38  =  53.92  days 
for  a  patient  who  practises  the  set  of  exercises  for  38  min. 

•  If  a  patient  did  not  exercise  at  all,  the  recovery  time  would  be  predicted  as  a  = 
100.28  days  by  the  model.  However,  see  item  (i)  in  Sect.  11.2.1  below  about 
interpreting  a  regression  line  outside  of  the  observed  data  range. 

We  can  also  obtain  these  results  by  using  R.  The  command  lm(Y~X)  fits  a  linear 

✓V 

model  and  provides  the  estimates  of  a  and  . 

lm(Y~X)  R 

We  can  draw  the  regression  line  onto  a  scatter  plot  using  the  command  abline, 
see  also  Fig.  11.3. 

plot (X , Y)  R 

abline (a=100 . 28,b=-l . 22) 


1 1 .2.1  Properties  of  the  Linear  Regression  Line 

There  are  a  couple  of  interesting  results  related  to  the  regression  line  and  the  least 
square  estimates. 


A 

(i)  As  a  rule  of  thumb,  one  should  interpret  the  regression  line  yt  =  a  +  fixi  only  in 
the  interval  [v(i),  X(n)].  For  example,  if  X  denotes  “Year”,  with  observed  values 
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from  1999  to  2015,  and  Y  denotes  the  “annual  volume  of  sales  of  a  particular 
company”,  then  a  prediction  for  the  year  2030  may  not  be  meaningful  or  valid 
because  a  linear  relationship  discovered  in  the  past  may  not  continue  to  hold  in 
the  future. 

/V 

(ii)  For  the  points  Pi  =  (v/,  y*),  forming  the  regression  line,  we  can  write 

9i  =  &  +  p*i  =  y  +  P(xi  -  x).  (11.8) 

(iii)  It  follows  for  Xi  =  x  that  yt  =  y,  i.e.  the  point  (x,  y)  always  lies  on  the  regres¬ 
sion  line.  The  linear  regression  line  therefore  always  passes  through  (x,  y). 

(iv)  The  sum  of  the  residuals  is  zero.  The  i th  residual  is  defined  as 

h  =  yt  -  yt  =  yt  -  (a  +  px{)  =  yt  -  [y  +  P(xt  -  x)]. 


The  sum  is  therefore 


n 


n 


n 


n 


X/'  =  Ylyi  _  ^ 

i  =  1  i= 1  i=l  i  =  1 

—  ny  —  ny  —  /3(nx  —  nx)  =  0. 

(v)  The  arithmetic  mean  of  y  is  the  same  as  the  arithmetic  mean  of  y\ 


(11.9) 


-  1  ^  ,  1 

y  =  -  X yt  =  ~(ny+P(nx  -  nx))  =  y. 
n  n 

i  =  1 


(11.10) 


(vi)  The  least  squares  estimate  /3  has  a  direct  relationship  with  the  correlation  coef¬ 
ficient  of  Bravais-Pearson: 


’XX 


>xy 


yy 


}yy 


’XX 


=  r 


}yy 


’XX 


(11.11) 


The  slope  of  the  regression  line  is  therefore  proportional  to  the  correlation  coef¬ 
ficient  r :  a  positive  correlation  coefficient  implies  a  positive  estimate  of  ft  and 
vice  versa.  However,  a  stronger  correlation  does  not  necessarily  imply  a  steeper 
slope  in  the  regression  analysis  because  the  slope  depends  upon  JSyy/ Sxx  as 
well. 


1 1 .3  Goodness  of  Fit 

While  one  can  easily  fit  a  linear  regression  model,  this  does  not  mean  that  the  model 
is  necessarily  good.  Consider  again  Fig.  11.1:  In  Fig.  11.1a,  the  regression  line  is 
clearly  capturing  the  linear  trend  seen  in  the  scatter  plot.  The  fit  of  the  model  to  the 
data  seems  good.  Figure  11.1b  shows  however  that  the  data  points  vary  considerably 
around  the  line.  The  quality  of  the  model  does  not  seem  to  be  very  good.  If  we  would 
use  the  regression  line  to  predict  the  data,  we  would  likely  obtain  poor  results.  It 
is  obvious  from  Fig.  11.1  that  the  model  provides  a  good  fit  to  the  data  when  the 
observations  are  close  to  the  line  and  capture  a  linear  trend. 


11.3  Goodness  of  Fit 
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(a)  Strong  linear  relationship  (b)  Weak  linear  relationship 

Fig.  11.4  Different  goodness  of  fit  for  different  data 


A  look  at  the  scatter  plot  provides  a  visual  and  qualitative  approach  to  judging 
the  quality  of  the  fitted  model.  Consider  Fig.  1 1.4a,  b  which  both  show  a  linear  trend 
in  the  data,  but  the  observations  in  Fig.  11.4a  are  closer  to  the  regression  line  than 
those  in  Fig.  1 1.4b.  Therefore,  the  goodness  of  fit  is  worse  in  the  latter  figure  and  any 
quantitative  measure  should  capture  this. 

A  quantitative  measure  for  the  goodness  of  fit  can  be  derived  by  means  of  vari¬ 
ance  decomposition  of  the  data.  The  total  variation  of  y  is  partitioned  into  two 
components — sum  of  squares  due  to  the  fitted  model  and  sum  of  squares  due  to 
random  errors  in  the  data: 

n  n  n 

-  y )2  =  yk>v  -  y )2  +  yk>v  -  9i )2  •  (11.12) 

i= 1  i= 1  i= 1 

^  ..  ^  ..  ^ 

S  0  Total  S  ^Regression  S  Q  Error 

The  proof  of  the  above  equation  is  given  in  Appendix  C.6. 

The  left-hand  side  of  (11.12)  represents  the  total  variability  in  y  with  respect  to 
y.  It  is  proportional  to  the  sample  variance  (3.21)  and  is  called  SQxotai  (total  sum  of 
squares).  The  first  term  on  the  right-hand  side  of  the  equation  measures  the  variability 
which  is  explained  by  the  fitted  linear  regression  model  ( S  <2 Regression,  sum  of  squares 
due  to  regression);  the  second  term  relates  to  the  error  sum  of  squares  ( S gError)  and 
reflects  the  variation  due  to  random  error  involved  in  the  fitted  model.  Larger  values 
of  S  (2 Error  indicate  that  the  deviations  of  the  observations  from  the  regression  line  are 
large.  This  implies  that  the  model  provides  a  bad  fit  to  the  data  and  that  the  goodness 
of  fit  is  low. 

Obviously,  one  would  like  to  have  a  fit  in  which  the  error  sum  of  squares  is  as 
small  as  possible.  To  judge  the  goodness  of  fit,  one  can  therefore  look  at  the  error 
sum  of  squares  in  relation  to  the  total  sum  of  squares:  in  an  ideal  situation,  if  the  error 
sum  of  squares  equals  zero,  the  total  sum  of  squares  is  equal  to  the  sum  of  squares 
due  to  regression  and  the  goodness  of  fit  is  optimal.  On  the  other  hand,  if  the  sum  of 
squares  due  to  error  is  large,  it  will  make  the  sum  of  squares  due  to  regression  smaller 
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and  the  goodness  of  fit  should  be  bad.  If  the  sum  of  squares  due  to  regression  is  zero, 
it  is  evident  that  the  model  fit  is  the  worst  possible.  These  thoughts  are  reflected  in 
the  criterion  for  the  goodness  of  fit,  also  known  as  R 2: 


R 


2 


SQr 


egression 


3  Q Total 


=  1  - 


SQe 


t  rror 


3  Q  Total 


(11.13) 


It  follows  from  the  above  definition  that  0  <  R2  <  1.  The  closer  R2  is  to  1,  the  better 
the  fit  because  S^Error  will  then  be  small.  The  closer  R2  is  to  0,  the  worse  the  fit, 
because  S^Error  will  then  be  large.  If  R2  takes  any  other  value,  say  R2  =  0.7,  it 
means  that  only  70  %  of  the  variation  in  data  is  explained  by  the  fitted  model,  and 
hence,  in  simple  language,  the  model  is  70  %  good.  An  important  point  to  remember 
is  that  R2  is  defined  only  when  there  is  an  intercept  term  in  the  model  (an  assumption 
we  make  throughout  this  chapter).  So  it  is  not  used  to  measure  the  goodness  of  fit  in 
models  without  an  intercept  term. 


Example  11.3.1  Consider  again  Fig.  11.1:  In  Fig.  11.1a,  the  line  and  data  points  fit 
well  together.  As  a  consequence  R2  is  high,  R2  =  0.82.  Figure  11.1b  shows  data 
points  with  large  deviations  from  the  regression  line;  therefore,  R2  is  small,  here 
R2  =  0.002.  Similarly,  in  Fig.  11.4a,  an  R2  of  0.97  relates  to  an  almost  perfect 
model  fit,  whereas  in  Fig.  11.4b,  the  model  describes  the  data  only  moderately  well 
( R 2  =  0.57). 


Example  11.3.2  Consider  again  Example  1 1.2.1  where  we  analysed  the  relationship 
between  exercise  intensity  and  recovery  time  for  patients  convalescing  from  knee 
surgery.  To  calculate  R 2,  we  need  the  following  table: 


i 

yi 

/V 

yi 

(yi  -  y ) 

(: yi  -  yr 

(yi  -  y) 

(y>  -  yY 

1 

90 

70.84 

32.92 

1083.73 

13.76 

189.34 

2 

65 

57.42 

7.92 

62.73 

0.34 

0.12 

3 

30 

22.04 

-27.08 

733.33 

-35.04 

1227.80 

4 

60 

75.72 

2.92 

8.53 

18.64 

347.45 

5 

60 

59.86 

2.92 

8.53 

2.78 

7.73 

6 

80 

67.18 

22.92 

525.33 

10.10 

102.01 

7 

45 

48.88 

-12.08 

145.93 

-8.2 

67.24 

8 

45 

50.10 

-12.08 

145.93 

-6.83 

48.72 

9 

80 

73.28 

22.92 

525.33 

16.20 

262.44 

10 

35 

39.12 

-22.08 

487.53 

-17.96 

322.56 

11 

50 

56.20 

-7.08 

50.13 

-0.88 

0.72 

12 

45 

62.30 

-12.08 

145.93 

5.22 

27.25 

Total 

3922.96 

2603.43 

We  calculate  R 2  with  these  results  as 


R 


2 


SQr 


egression 


Z klZi 

XU  (yt 


yf 

y )2 


2603.43 

3922.96 


0.66. 


3  Q  Total 
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We  conclude  that  the  regression  model  provides  a  reasonable  but  not  perfect  fit  to 
the  data  because  0.66  is  not  close  to  0,  but  also  not  very  close  to  1 .  About  66  %  of  the 
variability  in  the  data  can  be  explained  by  the  fitted  linear  regression  model.  The  rest 
is  random  error:  for  example,  individual  variation  in  the  recovery  time  of  patients 
due  to  genetic  and  environmental  factors,  surgeon  performance,  and  others. 

We  can  also  obtain  this  result  in  R  by  looking  at  the  summary  of  the  linear  model: 

summar y  ( lm  ( Y^X )  ) 


Please  note  that  we  give  a  detailed  explanation  of  the  model  summary  in  Sect.  11.7. 

There  is  a  direct  relationship  between  R 2  and  the  correlation  coefficient  of 
Bravais-Pearson  r: 


R2  =  r2  = 


>xy 


■\J $xx  Syy 


(11.14) 


The  proof  is  given  in  Appendix  C.6. 


Example  11.3.3  Consider  again  Examples  11.3  and  11.5  where  we  analysed  the 
association  of  exercising  time  and  time  to  regain  full  range  of  motion  after  knee 
surgery.  We  calculated  R 2  =  0.66.  We  therefore  know  that  the  correlation  coefficient 
is  r  =  =  \/0.66  ~  0.81. 


1 1 .4  Linear  Regression  with  a  Binary  Covariate 

Until  now,  we  have  assumed  that  the  covariate  X  is  continuous.  It  is  however  also 
straightforward  to  fit  a  linear  regression  model  when  X  is  binary,  i.e.  if  X  has  two 
categories.  In  this  case,  the  values  of  X  in  the  first  category  are  usually  coded  as 
0  and  the  values  of  X  in  the  second  category  are  coded  as  1.  For  example,  if  the 
binary  variable  is  “gender”,  we  replace  the  word  “male”  with  the  number  0  and 
the  word  “female”  with  1.  We  can  then  fit  a  linear  regression  model  using  these 
numbers,  but  the  interpretation  differs  from  the  interpretations  in  case  of  a  continuous 
variable.  Recall  the  definition  of  the  linear  model,  Y  =  a  +  /3X  +  e  with  E(^)  =  0; 
if  X  =  0  (male)  then  E(T|X  =  0)  =  a  and  if  X  =  1  (female),  then  E(T|X  =  1)  = 
a  +  /3-l=a  +  l3.  Thus,  a  is  the  average  value  of  Y  for  males,  i.e.  E(Y\X  =  0), 
and/3  =  E(Y\X  =  1)  —  E(Y\X  =  0).  It  follows  that  those  subjects  with  X  =  1  (e.g. 
females)  have  on  average  Y -values  which  are  /3  units  higher  than  subjects  with  X  =  0 
(e.g.  males). 
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Example  11.4.1  Recall  Examples  11.2.1,  11.3.2,  and  11.3.3  where  we  analysed  the 
association  of  exercising  time  and  recovery  time  after  knee  surgery.  We  keep  the 
values  of  Y  (recovery  time,  in  days)  and  replace  values  of  X  (exercising  time,  in 
minutes)  with  0  for  patients  exercising  for  less  than  40  min  and  with  1  for  patients 
exercising  for  40  min  or  more.  We  have  therefore  a  new  variable  X  indicating  whether 
a  patient  is  exercising  a  lot  (X  =  1)  or  not  ( X  =  0).  To  estimate  the  linear  regression 
line  y  =  a  +  fix,  we  first  calculate  v  =4/12  and  y  =  57.08.  To  obtain  Sxy  and  Sxx , 
we  need  the  following  table: 


i 

Xi 

yt 

fa  -  x ) 

(yt  -  y) 

(xi  —  x)(yt  -  y ) 

(xi  -  x)2 

1 

0 

90 

4 

12 

32.92 

-10.97 

0.11 

2 

0 

65 

4 

12 

7.92 

-2.64 

0.11 

3 

1 

30 

8 

12 

-27.08 

-18.05 

0.44 

4 

0 

60 

4 

12 

2.92 

-0.97 

0.11 

5 

0 

60 

4 

12 

2.92 

-0.97 

0.11 

6 

0 

80 

4 

12 

22.92 

-7.64 

0.11 

7 

1 

45 

8 

12 

-12.08 

-8.05 

0.44 

8 

1 

45 

8 

12 

-12.08 

-8.05 

0.44 

9 

0 

80 

4 

12 

22.92 

-7.64 

0.11 

10 

1 

35 

8 

12 

-22.08 

-14.72 

0.44 

11 

0 

50 

4 

12 

-7.08 

2.36 

0.11 

12 

0 

45 

4 

12 

-12.08 

4.03 

0.11 

Total 

Total 

-72.34 

2.64 

We  can  now  calculate  the  least  squares  estimates  of  a  and  /3  using  (11.6)  as 


}xy 


•XX 


Zfe  -  x)(yi  -  y)  -72.34 


Z(*/  -  x) 


2.64 


-27.4, 


a  =  y  —  /3x  =  57.08  -  (-27.4)  •  —  =  66.2 


The  fitted  regression  line  is: 


The  interpretation  is: 


12 


y  =  66.2  —  27.4  •  x. 


•  The  average  recovery  time  for  patients  doing  little  exercise  is  66.2  —  27.4  •  0  = 
66.2  days. 

•  Patients  exercising  heavily  (x  =  1)  have,  on  average,  a  27.4  days  shorter  recovery 
period  (/3  =  —27.4)  than  patients  with  a  short  exercise  time  (x  =  0). 

•  The  average  recovery  time  for  patients  exercising  heavily  is  38.8  days  (66.2  — 
27.4  •  1  for  v  =  1). 

•  These  interpretations  are  visualized  in  Fig.  11.5.  Because  “exercise  time”  (X) 
is  considered  to  be  a  nominal  variable  (two  categories  representing  the  intervals 
[0;  40)  and  (40;  oo)),  we  can  plot  the  regression  line  on  a  scatter  plot  as  two  parallel 
lines. 


1 1 .5  Linear  Regression  with  a  Transformed  Covariate 
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Fig.  11.5  Scatter  plot  and 
regression  lines  for 
Examples  11.4.1  and  11.5.1 
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1 1 .5  Linear  Regression  with  a  Transformed  Covariate 

Recall  that  a  model  is  said  to  be  linear  when  it  is  linear  in  its  parameters.  The  definition 
of  a  linear  model  is  not  at  all  related  to  the  linearity  in  the  covariate.  For  example, 

Y  =  a  +  p2X  +  e 

is  not  a  linear  model  because  the  right-hand  side  of  the  equation  is  not  a  linear 
function  in  /3.  However, 

Y  =  a  +  PX2  +  e 

is  a  linear  model.  This  model  can  be  fitted  as  for  any  other  linear  model:  we  obtain 

/V 

a  and  /3  as  usual  and  simply  use  the  squared  values  of  X  instead  of  X.  This  can 
be  justified  by  considering  X*  =  X 2,  and  then,  the  model  is  written  as  Y  =  a  + 
/3X*  +  e  which  is  again  a  linear  model.  Only  the  interpretation  changes:  for  each 
unit  increase  in  the  squared  value  of  X,  i.e.  X*,  Y  increases  by  /3  units.  Such  an 
interpretation  is  often  not  even  needed.  One  could  simply  plot  the  regression  line  of 
Y  on  X*  to  visualize  the  functional  form  of  the  effect.  This  is  also  highlighted  in  the 
following  example. 

Example  11.5.1  Recall  Examples  11.2.1,  11.3.2,  11.3.3,  and  11.4.1  where  we 
analysed  the  association  of  exercising  time  and  recovery  time  after  knee  surgery.  We 
estimated  /3  as  —1.22  by  using  X  as  it  is,  as  a  continuous  variable,  see  also  Fig.  1 1.3. 
When  using  a  binary  X,  based  on  a  cut-off  of  40 min,  we  obtained  /3  =  —27.4,  see 
also  Fig.  1 1.5.  If  we  now  use  \fx  rather  than  X,  we  obtain /3  =  —15.1.  This  means  for 
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an  increase  of  1  unit  of  the  square  root  of  exercising  time,  the  recovery  time  decreases 
by  15.1  days.  Such  an  interpretation  will  be  difficult  to  understand  for  many  people. 
It  is  better  to  plot  the  linear  regression  line  y  =  145.8  —  15.1  •  see  Fig.  11.5.  We 
can  see  that  the  new  nonlinear  line  (obtained  from  a  linear  model)  fits  the  data  nicely 
and  it  may  even  be  preferable  to  a  straight  regression  line.  Moreover,  the  value  of  a 
substantially  increased  with  this  modelling  approach,  highlighting  that  no  exercis¬ 
ing  at  all  may  severely  delay  recovery  from  the  surgery  (which  is  biologically  more 
meaningful).  In  R,  we  obtain  these  results  by  either  creating  a  new  variable  \fx  or 
by  using  the  I  ( )  command  which  allows  specifying  transformations  in  regression 
models. 

newX  <-  sqrt(X) 

lm(Y^newX)  #option  1 

lm(Y~I  (sqrt  (X) ) )  #option  2 

Generally  the  covariate  X  can  be  replaced  by  any  transformation  of  it,  such  as 
using  log(X),  +/X,  sin(X),  or  X2.  The  details  are  discussed  in  Sect.  11.6.3.  It  is 
also  possible  to  compare  the  quality  and  goodness  of  fit  of  models  with  different 
transformations  (see  Sect.  11.8). 


1 1 .6  Linear  Regression  with  Multiple  Covariates 

Up  to  now,  we  have  only  considered  two  variables — Y  and  X.  We  therefore  assume 
that  Y  is  affected  by  only  one  covariate  X.  However,  in  many  situations,  there  might 
be  more  than  one  covariate  which  affects  Y  and  consequently  all  of  them  are  relevant 
to  the  analysis  (see  also  Sect.  1 1.10  on  the  difference  between  association  and  causa¬ 
tion).  For  example,  the  yield  of  a  crop  depends  on  several  factors  such  as  quantity  of 
fertilizer,  irrigation,  and  temperature,  among  others.  In  another  example,  in  the  pizza 
data  set  (described  in  Appendix  A.4),  many  variables  could  potentially  be  associated 
with  delivery  time — driver,  amount  of  food  ordered,  operator  handling  the  order, 
whether  it  is  a  weekend,  and  so  on.  A  reasonable  question  to  ask  is:  Do  different 
drivers  have  (on  average)  a  different  delivery  time — given  they  deal  with  the  same 
operator,  the  amount  of  food  ordered  is  the  same,  the  day  is  the  same,  etc.  These  kind 
of  questions  can  be  answered  with  multiple  linear  regression.  The  model  contains 
more  than  one,  say  p ,  covariates  X\,  X2, . . . ,  Xp.  The  linear  model  (11.2)  can  be 
extended  as 

Y  =  fio  +  fi\X\  +  •  •  •  +  PpXp  +  e.  (11.15) 

Note  that  the  intercept  term  is  denoted  here  by  fo.  In  comparison  with  (11 .2),  a  =  fio 
and  ft  =  ft\. 
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Example  11.6.1  For  the  pizza  delivery  data,  a  particular  linear  model  with  multiple 
covariates  could  be  specified  as  follows: 

Delivery  Time  =  0o  +  0i Number  pizzas  ordered  +  02Weekend(O/l) 

+03  Operator  (0/1)  +  e. 


1 1 .6.1  Matrix  Notation 

If  we  have  a  data  set  of  n  observations,  then  every  set  of  observations  satisfies  model 
(11.15)  and  we  can  write 

yi  =  0o  +  0i*n  +  02*12  +  •  •  •  +  PpX\P  +^i 

J2  =  00  +  01*21  +  02*22  H - h  PpX2p  +e2 

yn  =  0o  +  01*72 1  +  02 *22 2  +  •  •  '  +  fipXnp  -\-en 

It  is  possible  to  write  the  n  equations  in  (11.16)  in  matrix  notation  as 

y  =  X0  +  e 

where 


/-Vl\ 

/  1  X]  1  X\2  ■■■  XI p\ 

( M 

J2 

,  x  = 

1  X21  X22  ■■■  XI p 

,  fi  = 

a 

,  e  = 

e2 

\  yn  ) 

\  1  Xn\  Xn2  '  •  '  Xnp  J 

Kfip ) 

Ken  ) 

The  letters  y,  X,  0,  e  are  written  in  bold  because  they  refer  to  vectors  and  matrices 
rather  than  scalars.  The  capital  letter  X  makes  it  clear  that  X  is  a  matrix  of  order 
n  x  p  representing  the  n  observations  on  each  of  the  covariates  X\,  X2,  . . . ,  Xp. 
Similarly,  y  is  a  n  x  1  vector  of  n  observations  on  Y,  0  is  a  p  x  1  vector  of  regression 
coefficients  associated  with  X\ ,  X2,  . . . ,  Xp,  and  e  is  a  n  x  1  vector  of  n  errors.  The 
lower  case  letter  x  relates  to  a  vector  representing  a  variable  which  means  we  can 
denote  the  multiple  linear  model  from  now  on  also  as 

y  =  0o  1  +  0ixi  +  •  •  •  +  0/?xp  +  e  (11.18) 

where  1  is  the  n  x  1  vector  of  l’s.  We  assume  that  E(e)  =  0  and  Cov(e)  =  a2 1  (see 
Sect.  11.9  for  more  details). 

We  would  like  to  highlight  that  X  is  not  the  data  matrix.  The  matrix  X  is  called 
the  design  matrix  and  contains  both  a  column  of  l’s  denoting  the  presence  of  the 
intercept  term  and  all  explanatory  variables  which  are  relevant  to  the  multiple  linear 
model  (including  possible  transformations  and  interactions,  see  Sects.  11.6.3  and 
11.7.3).  The  errors  e  reflect  the  deviations  of  the  observations  from  the  regression 
line  and  therefore  the  difference  between  the  observed  and  fitted  relationships.  Such 
deviations  may  occur  for  various  reasons.  For  example,  the  measured  values  can  be 


(11.16) 

(11.17) 


264 


11  Linear  Regression 


affected  by  variables  which  are  not  included  in  the  model,  the  variables  may  not  be 
accurately  measured,  there  is  unmeasured  genetic  variability  in  the  study  population, 
and  all  of  which  are  covered  in  the  error  e.  The  estimate  of  p  is  obtained  by  using  the 
least  squares  principle  by  minimizing  X?=i  ef  —  e  e-  The  least  squares  estimate 
of  P  is  given  by 

j)  =  (X'Xr'X'y.  (11.19) 

/V 

The  vector  ft  contains  the  estimates  of  (Po,  P\, . . . ,  pp)' .  We  can  interpret  it  as 

/V 

earlier:  Po  is  the  estimate  of  the  intercept  which  we  obtain  because  we  have  added  the 

/V  /V  /V 

column  of  l’s  (and  is  identical  to  a  in  (11.1)).  The  estimates  P\,  @2,  •  •  • ,  ftp  refer  to 
the  regression  coefficients  associated  with  the  variables  xi,  X2, . . . ,  xp,  respectively. 

A 

The  interpretation  of  pj  is  that  it  represents  the  partial  change  in  y*  when  the  value 
of  Xi  changes  by  one  unit  keeping  all  other  covariates  fixed. 

A  possible  interpretation  of  the  intercept  term  is  that  when  all  covariates  equal 
zero  then 

E(y)  =  Po  +  Pi  •  0  +  •  •  •  +  Pp  •  0  =  p0.  (11.20) 

There  are  many  situations  in  real  life  for  which  there  is  no  meaningful  interpretation 
of  the  intercept  term  because  one  or  many  covariates  cannot  take  the  value  zero.  For 
instance,  in  the  pizza  data  set,  the  bill  can  never  be  €0,  and  thus,  there  is  no  need 
to  describe  the  average  delivery  time  for  a  given  bill  of  €0.  The  intercept  term  here 
serves  the  purpose  of  improvement  of  the  model  fit,  but  it  will  not  be  interpreted. 

In  some  situations,  it  may  happen  that  the  average  value  of  y  is  zero  when  all 
co variates  are  zero.  Then,  the  intercept  term  is  zero  as  well  and  does  not  improve 
the  model.  For  example,  suppose  the  salary  of  a  person  depends  on  two  factors — 
education  level  and  type  of  qualification.  If  any  person  is  completely  illiterate,  even 
then  we  observe  in  practice  that  his  salary  is  never  zero.  So  in  this  case,  there  is  a 
benefit  of  the  intercept  term.  In  another  example,  consider  that  the  velocity  of  a  car 
depends  on  two  variables — acceleration  and  quantity  of  petrol.  If  these  two  variables 
take  values  of  zero,  the  velocity  is  zero  on  a  plane  surface.  The  intercept  term  will 
therefore  be  zero  as  well  and  yields  no  model  improvement. 

Example  11.6.2  Consider  the  pizza  data  described  in  Appendix  A.4.  The  data  matrix 


A  is  as  follows: 

Day 

Date 

Time  •  •  • 

Discount 

/  Thursday 

1 -May- 14 

35.1  ••• 

1  \ 

Thursday 

1 -May- 14 

25.2  •  •  • 

0 

A  = 

y  Saturday 

31 -May- 14 

35.7  ... 

0  J 

Suppose  the  manager  has  the  hypothesis  that  the  operator  and  the  overall  bill  (as 
a  proxy  for  the  amount  ordered  from  the  customer)  influence  the  delivery  time.  We 
can  postulate  a  linear  model  to  describe  this  relationship  as 

Delivery  Time  =  /3q  +  ftBill  +  p2 Operator  +  e. 
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The  model  in  matrix  notation  is  as  follows: 


/  35.1  \ 
25.2 


/I  58.4  1\ 
1  26.4  0  ' 


\35.7  /  \  1  42.7  0/ 

y 


X 


To  understand  the  associations  in  the  data,  we  need  to  estimate  p  which  we  obtain  by 
the  least  squares  estimator  ft  =  (X'X)-AX'y.  Rather  than  doing  this  tiresome  task 
manually,  we  use  R : 


lm(time~bill+operator) 


If  we  have  more  than  one  covariate  in  a  linear  regression  model,  we  simply  add 
all  of  them  separated  by  the  +  sign  when  specifying  the  model  formula  in  the  lm  ( ) 

AAA 

function.  We  obtain  fio  =  23.1,  p\  =  0.26,  ft 2  =  0.16.  The  interpretation  of  these 
parameters  is  as  follows: 

•  For  each  extra  €  that  is  spent,  the  delivery  time  increases  by  0.26  min.  Or,  for  each 
extra  €10  spent,  the  delivery  time  increases  on  average  by  2.6  min. 

•  The  delivery  time  is  on  average  0.16  min  longer  for  operator  1  compared  with 
operator  0. 

•  For  operator  0  and  a  bill  of  €0,  the  expected  delivery  time  is  /3o  =  23.1  min. 
However,  there  is  no  bill  of  €0,  and  therefore,  the  intercept  Pq  cannot  be  interpreted 
meaningfully  here  and  is  included  only  for  improvement  of  the  model  fit. 


1 1 .6.2  Categorical  Covariates 

We  now  consider  the  case  when  covariates  are  categorical  rather  than  continuous. 
Examples 

•  Region:  East,  West,  South,  North, 

•  Marital  status:  single,  married,  divorced,  widowed, 

•  Day:  Monday,  Tuesday,  . . .,  Sunday. 

We  have  already  described  how  to  treat  a  categorical  variable  which  consists  of 
two  categories,  i.e.  a  binary  variable:  one  category  is  replaced  by  l’s  and  the  other 
category  is  replaced  by  0’s  subjects  who  belong  to  the  category  x  =  l  have  on  average 
y -values  which  are  ft  units  higher  than  those  subjects  with  v  =  0. 
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Consider  a  variable  v  which  has  more  than  two  categories,  say  k  >  2  categories.  To 
include  such  a  variable  in  a  regression  model,  we  create  k  —  1  new  variables  Xi ,  i  = 
1,  2,  ...  ,k  —  1.  Similar  to  how  we  treat  binary  variables,  each  of  these  variables  is 
a  dummy  variable  and  consists  of  l’s  for  units  which  belong  to  the  category  of 
interest  and  O’ s  for  the  other  category 

1  for  category  i , 

0  otherwise. 


Xi  = 


(11.21) 


The  category  for  which  we  do  not  create  a  dummy  variable  is  called  the  reference 
category,  and  the  interpretation  of  the  parameters  of  the  dummy  variables  is  with 
respect  to  this  category.  For  example,  for  category  i ,  the  y-values  are  on  average  Pi 
higher  than  for  the  reference  category.  The  concept  of  creating  dummy  variables  and 
interpreting  them  is  explained  in  the  following  example. 


Example  11.6.3  Consider  again  the  pizza  data  set  described  in  Appendix  A.4.  The 
manager  may  hypothesize  that  the  delivery  times  vary  with  respect  to  the  branch. 
There  are  k  =  3  branches:  East,  West,  Centre.  Instead  of  using  the  variable  x  = 
branch,  we  create  (k  —  1),  i.e.  (3  —  1)  =  2  new  variables  denoting  xi  =  East  and 


X2  =  West.  We  set  xi  =  1  for  those  deliveries  which  come  from  the  branch  in  the 

East  and  set  x\  =  0  for  other  deliveries.  Similarly,  we  set  X2  = 

1  for  those  deliveries 

which  come  from  the  West  and  X2  =  0  for  other  deliveries.  The  data  then  is  as  follows: 

Delivery 

y  (Delivery  Time) 

x  (Branch) 

xi  (East) 

X2  (West) 

/  1 

35.1 

East 

1 

0  \ 

2 

25.2 

East 

1 

0 

3 

45.6 

West 

0 

1 

4 

29.4 

East 

1 

0 

5 

30.0 

West 

0 

1 

6 

40.3 

Centre 

0 

0 

\  1266 

35.7 

West 

0 

1  ) 

Deliveries  which  come  from  the  East  have  xi  = 

1  and  X2  = 

-  0,  deliveries  which 

come  from  the  West  have  xi  =  0  and  X2  =  1,  and  deliveries  from  the  Centre  have 
xi  =  0  and  X2  =  0.  The  regression  model  of  interest  is  thus 

y  =  ft  +  ft  East  +  ft  West  +  e. 

We  can  calculate  the  least  squares  estimate  ft  =  (ft,  ft,  ft)7  =  (X/X)_iX/y  via  R: 
either  (i)  we  create  the  dummy  variables  ourselves  or  (ii)  we  ask  R  to  create  it  for 
us.  This  requires  that  “branch”  is  a  factor  variable  (which  it  is  in  our  data,  but  if  it 
was  not,  then  we  would  have  to  define  it  in  the  model  formula  via  as  .  factor  () ). 

East  <-  as .numeric (branch== ' East ' )  JC^ 

West  <-  as .numeric (branch== ' West ' ) 

lm(time~  East+West)  #  option  1 

lm(time~  branch)  #  option  2a 

lm(time~  as . factor (branch) )  #  option  2b 
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We  obtain  the  following  results: 

Y  =  36.3  —  5.2East  —  l.lWest. 

The  interpretations  are  as  follows: 

•  The  average  delivery  time  for  the  branch  in  the  Centre  is  36.3  min,  the  delivery 
time  for  the  branch  in  the  East  is  36.3  —  5.2  =  3  E 1  min,  and  the  predicted  delivery 
time  for  the  West  is  36.3  —  El  =  35.2  min. 

•  Therefore,  deliveries  arrive  on  average  5.2  min  earlier  in  the  East  compared  with 

A 

the  centre  (/3\  =  —5.2),  and  deliveries  in  the  West  arrive  on  average  1 . 1  min  earlier 

/V 

than  in  the  centre  (/E  =  —1.1).  In  other  words,  it  takes  5.2  min  less  to  deliver  a 
pizza  in  the  East  than  in  the  Centre.  The  deliveries  in  the  West  take  on  average 
1.1  min  less  than  in  the  Centre. 

Consider  now  a  covariate  with  k  =  3  categories  for  which  we  create  two  new 
dummy  variables,  xi  and  X2.  The  linear  model  is  y  =  fio  +  P\X\  +  @2*2  +  e. 

•  For  xi  =  1,  we  obtain  E(y)  =  fio  +  Pi  •  1  +  Pi  •  0  =  =  E( y|xi  =  1, 

X2  =  0); 

•  For  X2  =  1,  we  obtain  E(y)  =  fio  +  Pi  •  0  +  P2  •  1  =  Po  +  Pi  =  E(y\x\  =  0, 
X2  =  1);  and 

•  For  the  reference  category  (xi  =  X2  =  0),  we  obtain  y  =  +  fa  •  Q  +  •  0  = 

Po  =  E(y|xi  =  x2  =  0). 

We  can  thus  conclude  that  the  intercept  Po  =  E(y|xi  =  0,  X2  =0)  describes  the 
average  y  for  the  reference  category,  that  fi\  =  E(y|xi  =  1,  *2  =  0)  —  E(y\x\  = 
0,  X2  =  0)  describes  the  difference  between  the  first  and  the  reference  category, 
and  that  P2  =  E(y|xi  =  0,  X2  =  1)  —  E(y|xi  =  0,  X2  =  0)  describes  the  difference 
between  the  second  and  the  reference  category. 

Remark  11.6.1  There  are  other  ways  of  recoding  categorical  variables,  such  as  effect 
coding.  However,  we  do  not  describe  them  in  this  book. 


1 1 .6.3  Transformations 

As  we  have  already  outlined  in  Sect.  1 1.5,  if  x  is  transformed  by  a  function  T (x)  = 
(T Oi),  T O2),  •  •  • ,  T (xn))  then 

y  =  f(x)  =  Po  +  PiT(x)  +  e  (11.22) 

is  still  a  linear  model  because  the  model  is  linear  in  its  parameters  ft.  Popular  trans¬ 
formations  T(x)  are  log(x),  exp(x),  xp,  among  others.  The  choice  of  such  a 
function  is  typically  motivated  by  the  application  and  data  at  hand.  Alternatively,  a 
relationship  between  y  and  x  can  be  modelled  via  a  polynomial  as  follows: 

y  =  /(x)  =  fa  +  fiix  +  fa*2  H - f  Ppxp  +  e. 


(11.23) 
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Fig.  11. 6  Scatter  plot  and 
regression  lines  for  Example 


11.6.4 


Example  11.6.4  Consider  again  Examples  11.2.1,  11.3.2,  11.3.3,  11.4.1,  and  11.5.1 
where  we  analysed  the  association  of  intensity  of  exercising  and  recovery  time  after 
knee  surgery.  The  linear  regression  line  was  estimated  as 

Recovery  Time  =  100.28  —  1.22  Exercising  Time. 

One  could  question  whether  the  association  is  indeed  linear  and  fit  the  second-  and 
third-order  polynomials: 

Recovery  Time  =  fio  —  /3\ Exercise  +  /^Exercise2, 

Recovery  Time  =  fio  —  /3\ Exercise  +  /^Exercise2  +  /^Exercise3. 

To  obtain  the  estimates  we  can  use  R  : 

lm(Y~X+I(X2))  f5 

lm(Y~X+I(X2)+I(X3)) 

The  results  are  fio  =  121.8,  /3\  =  —2.4,  fc  =  0.014  and  fio  =  12.9,  /3\  =  6.90, 
/3 2  =  —0.23,  ^63  =  0.002  for  the  two  models  respectively.  While  it  is  difficult  to 
interpret  these  coefficients  directly,  we  can  simply  plot  the  regression  lines  on  a 
scatter  plot  (see  Fig.  1 1.6). 

We  see  that  the  regression  based  on  the  quadratic  polynomial  visually  fits  the 
data  slightly  better  than  the  linear  polynomial.  It  seems  as  if  the  relation  between 
recovery  time  and  exercise  time  is  not  exactly  linear  and  the  association  is  better 
modelled  through  a  second-order  polynomial.  The  regression  line  based  on  the  cubic 
polynomial  seems  to  be  even  closer  to  the  measured  points;  however,  the  functional 
form  of  the  association  looks  questionable.  Driven  by  a  single  data  point,  we  obtain  a 
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regression  line  which  suggests  a  heavily  increased  recovery  time  for  exercising  times 
greater  than  65  min.  While  it  is  possible  that  too  much  exercising  causes  damage  to 
the  knee  and  delays  recovery,  the  data  does  not  seem  to  be  clear  enough  to  support 
the  shape  suggested  by  the  model.  This  example  illustrates  the  trade-off  between 
fit  (i.e.  how  good  the  regression  line  fits  the  data)  and  parsimony  (i.e.  how  well  a 
model  can  be  understood  if  it  becomes  more  complex).  Section  11.8  explores  this 
non-trivial  issue  in  more  detail. 


Transformations  of  the  Outcome  Variable.  It  is  also  possible  to  apply  transforma¬ 
tions  to  the  outcome  variable.  While  in  many  situations,  this  makes  interpretations 
quite  difficult,  a  log  transformation  is  quite  common  and  easy  to  interpret.  Consider 

the  log-linear  model 

i°g  y  —  A)1  +  Pixi  +  •  •  •  +  Ppx  p  +  e 

where  y  is  required  to  be  greater  than  0.  Exponentiating  on  both  sides  leads  to 

y  =  gfo1  .  . ePpx P  . 

It  can  be  easily  seen  that  a  one  unit  increase  in  xj  multiplies  y  by  e^i .  Therefore,  if 
y  is  log-transformed,  one  simply  interprets  exp(/?)  instead  of  /?.  For  example,  if  y 
is  the  yearly  income,  x  denotes  gender  (1  =  male,  0  =  female),  and  /3\  =0.2  then 
one  can  say  that  a  man’s  income  is  exp(0.2)  =  1.22  times  higher  (i.e.  22  %)  than  a 
woman’s. 


1 1 .7  The  Inductive  View  of  Linear  Regression 

So  far  we  have  introduced  linear  regression  as  a  method  which  describes  the  rela¬ 
tionship  between  dependent  and  independent  variables  through  the  best  fit  to  the 
data.  However,  as  we  have  highlighted  in  earlier  chapters,  often  we  are  interested 
not  only  in  describing  the  data  but  also  in  drawing  conclusions  from  a  sample  about 
the  population  of  interest.  For  instance,  in  an  earlier  example,  we  have  estimated  the 
association  of  branch  and  delivery  time  for  the  pizza  delivery  data.  The  linear  model 
was  estimated  as  delivery  time  =  36.3  —  5.2  East  —  1.1  West;  this  indicates  that  the 
delivery  time  is  on  average  5.2 min  shorter  for  the  branch  in  the  East  of  the  town 
compared  with  the  central  branch  and  1 . 1  min  shorter  for  the  branch  in  the  West  com¬ 
pared  with  the  central  branch.  When  considering  all  pizza  deliveries  (the  population), 
not  only  those  collected  by  us  over  the  period  of  a  month  (the  sample),  we  might  ask 
ourselves  whether  1.1  min  is  a  real  difference  valid  for  the  entire  population  or  just 
caused  by  random  error  involved  in  the  sample  we  chose.  In  particular,  we  would 
also  like  to  know  what  the  95  %  confidence  interval  for  our  parameter  estimate  is. 
Does  the  interval  cover  “zero”?  If  yes,  we  might  conclude  that  we  cannot  be  very 
sure  that  there  is  an  association  and  do  not  reject  the  null  hypothesis  that  /3west  =  0. 
In  reference  to  Chaps.  9  and  10,  we  now  apply  the  concepts  of  statistical  inference 
and  testing  in  the  context  of  regression  analysis. 
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Point  and  interval  estimation  in  the  linear  model.  We  now  rewrite  the  model  for 
the  purpose  of  introducing  statistical  inference  to  linear  regression  as 

y  —  Pol  +  0ixi  +  •  •  •  +  ftp*  p  +  e 
=  Xp  +  e  with  e  ~  N(0,  ct2I)  .  (11.24) 


where  y  is  the  n  x  1  vector  of  outcomes  and  X  is  the  n  x  (p  +  1)  design  matrix 
(including  a  column  of  l’s  for  the  intercept).  The  identity  matrix  I  consists  of  l’s  on 
the  diagonal  and  0’s  elsewhere,  and  the  parameter  vector  is  0  =  (0o,  pi, ... ,  pp)r . 
We  would  like  to  estimate  0  to  make  conclusions  about  a  relationship  in  the  pop¬ 
ulation.  Similarly,  e  reflects  the  random  errors  in  the  population.  Most  importantly, 
the  linear  model  now  contains  assumptions  about  the  errors.  They  are  assumed  to 
be  normally  distributed,  N(0,  a2 1),  which  means  that  the  expectation  of  the  errors 
is  0,  E (et)  =  0,  the  variance  is  Varfe)  =  a 2  (and  therefore  the  same  for  all  <?;),  and 
it  follows  from  Var(e)  =  cr2 1  that  Covfe ,  eP)  =  0  for  all  i  ^  i' .  The  assumption  of  a 
normal  distribution  is  required  to  construct  confidence  intervals  and  test  of  hypothe¬ 
ses  statistics.  More  details  about  these  assumptions  and  the  procedures  to  test  their 
validity  on  the  basis  of  a  given  sample  of  data  are  explained  in  Sect.  11.9. 

The  least  squares  estimate  of  0  is  obtained  by 

p  =  (X'Xr'X'y.  (11.25) 

a.  /v 

It  can  be  shown  that  0  follow  a  normal  distribution  with  mean  E(0)  =  0  and  covari- 

A  r\ 

ance  matrix  Var(0)  =  crz  I  as 

P  ~  N(P,a2(X'X)~l).  (11.26) 

A.  /V 

Note  that  0  is  unbiased  (since  E(0)  =  0);  more  details  about  (11.26)  can  be  found 
in  Appendix  C.6.  An  unbiased  estimator  of  a2  is 


2^  (y  —  X0)r(y  —  X0)  =  e;e 

n  —  (p  +  1)  n  —  (p  +  1) 


1 

n  -  (p  +  1) 


(11.27) 


The  errors  are  estimated  from  the  data  as  e  =  y  —  X0  and  are  called  residuals. 

Before  giving  a  detailed  data  example,  we  would  like  to  outline  how  to  draw 
conclusions  about  the  population  of  interest  from  the  linear  model.  As  we  have  seen, 
both  0  and  a  2  are  unknown  in  the  model  and  are  estimated  from  the  data  using  ( 1 1 .25) 
and  (11.27).  These  are  our  point  estimates.  We  note  that  if  0 j  =  0,  then  PjX j  =  0, 
and  then,  the  model  will  not  contain  the  term  0yxj.  This  means  that  the  covariate  xj 
does  not  contribute  to  explaining  the  variation  in  y.  Testing  the  hypothesis  pj  =  0 
is  therefore  equivalent  to  testing  whether  xj  is  associated  with  y  or  not  in  the  sense 
that  it  helps  to  explain  the  variations  in  y  or  not.  To  test  whether  the  point  estimate  is 
different  from  zero,  or  whether  the  deviations  of  estimates  from  zero  can  be  explained 
by  random  variation,  we  have  the  following  options: 
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1.  The  100(1  —  a) %  confidence  interval  for  each  /3 j  is 


yv 


(11.28) 


with  <jg  =  yj SjjG2  where  sjj  is  the  j th  diagonal  element  of  the  matrix  (XrX)  1 . 

If  the  confidence  interval  does  not  overlap  0,  then  we  can  conclude  that  /3  is 
different  from  0  and  therefore  Xi  is  associated  with  y  and  it  is  a  relevant  variable.  If 
the  confidence  interval  includes  0,  we  cannot  conclude  that  there  is  an  association 
between  x\  and  y. 

A 

2.  We  can  formulate  a  formal  test  which  tests  if  fij  is  different  from  0.  The  hypothe¬ 
ses  are: 


Hq  :  Pj  =  0  versus  H\  :  /=  0. 

The  test  statistic 


follows  a  tn-p- 1  distribution  under  Ho.  If  |  T  \  >  tn-p- 1;  i-a/2»  we  can  reject  the 
null  hypothesis  (at  a  level  of  significance);  otherwise,  we  accept  it. 

3.  The  decisions  we  get  from  the  confidence  interval  and  the  T -statistic  from  points 
1  and  2  are  identical  to  those  obtained  by  checking  whether  the  p -value  (see  also 
Sect.  10.2.6)  is  smaller  than  a ,  in  which  case  we  also  reject  the  null  hypothesis. 


Example  11.7.1  Recall  Examples  11.2.1,  11.3.2,  11.3.3,  11.4.1,  11.5.1,  and  11.6.4 
where  we  analysed  the  association  of  exercising  time  and  time  of  recovery  from  knee 
surgery.  We  can  use  R  to  obtain  a  full  inductive  summary  of  the  linear  model: 

summary  (lm(Y^X) ) 


Coefficients : 

Estimate  Std.  Error  t  value  Pr (> I t | ) 
(Intercept)  100.1244  10.3571  9.667  2.17e-06  *** 

X  -1.2153  0.2768  -4.391  0.00135  ** 


Signif.  codes:  0  ***  0.001  **  0 . 01  *  0 . 05  .  0 . 1  1 

Residual  standard  error:  11.58  on  10  degrees  of  freedom 

Now,  we  explain  the  meaning  and  interpretation  of  different  terms  involved  in  the 
output. 

•  Under  “Estimate”,  the  parameter  estimates  are  listed  and  we  read  that  the  linear 
model  is  fitted  as  y  (recovery  time)  =  100.1244  —  1.2153  •  x  (exercise  time).  The 
subtle  differences  to  Example  11.2.1  are  due  to  rounding  of  numerical  values. 

•  The  variance  is  estimated  as  a 2  =  11. 582  (“Residual  standard  error”).  This  is 
easily  calculated  manually  using  the  residual  sum  of  squares:  ^  ej/(n  —  p  —  1)  = 
1/10  •  {(90  -  70.84)2  +  . . .  +  (45  -  62.30)2}  =  11.582. 
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•  The  standard  errors  az  are  listed  under  “Std.  Error”.  Given  that  n  —  p  —  1  = 
12  —  1  —  1  =  10,  and  therefore  /jo;o.975  =  2.28,  we  can  construct  a  confidence 
interval  for  age: 

-1.22  db  2.28  •  0.28  =  [-1.86;  -0.58] 


The  interval  does  not  include  0,  and  we  therefore  conclude  that  there  is  an  asso¬ 
ciation  between  exercising  time  and  recovery  time.  The  random  error  involved 

/V 

in  the  data  is  not  sufficient  to  explain  the  deviation  of  fa  =  —1.22  from  0  (given 
a  =  0.05). 


We  therefore  reject  the  null  hypothesis  that  ft j  =0.  This  can  also  be  seen  by  com¬ 
paring  the  test  statistic  (listed  under  “t  value”  and  obtained  by  (0j  —  0)/  /a?  = 


—  1.22/0.277)  with  /jo,o.975>  I  —  4.39 1  >  2.28.  Moreover,  p  =  0.001355  <  a  = 
0.05.  We  can  say  that  there  is  a  significant  association  between  exercising  and 
recovery  time. 


Sometimes,  one  is  interested  in  whether  a  regression  model  is  useful  in  the  sense 
that  all  fa ’s  are  different  from  zero,  and  we  therefore  can  conclude  that  there  is  an 
association  between  any  x;  and  y.  The  null  hypothesis  is 

H0:  fa  =  Pi  =  •••  =  PP  =  0 
and  can  be  tested  by  the  overall  F-test 

F  =  (y  -  y)'(y  -  y  )/(p)  =  n-  P-  \  YJJn  -  y )2 

(y  -  y)'(y  -  y)/(«  -  p  -  i)  p  Z,-^2 

The  null  hypothesis  is  rejected  if  F  >  Fi-a-p^n-p-i .  Note  that  the  null  hypothesis  in 
this  case  tests  only  the  equality  of  slope  parameters  and  does  not  include  the  intercept 
term. 


Example  11.7.2  In  this  chapter,  we  have  already  explored  the  associations  between 
branch,  operator,  and  bill  with  delivery  time  for  the  pizza  data  (Appendix  A.4).  If  we 
fit  a  multiple  linear  model  including  all  of  the  three  variables,  we  obtain  the  following 
results: 

summary  (lm(time^branch+bill+operator) ) 


Coefficients : 


Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

26.19138 

0.78752 

33.258 

<  2e-16 

*** 

branchEast 

-3.03606 

0.42330 

-7.172 

1 . 25e-12 

*** 

branchWest 

-0.50339 

0.38907 

-1.294 

0.196 

bill 

0.21319 

0.01535 

13.885 

<  2e-16 

*** 

operatorMelissa 

0.15973 

0.31784 

0.503 

0.615 

Signif .  codes: 

0  ***  0.1 

001  **  0.01 

*  0.05  . 

0.1  1 

Residual  standard  error:  5.653  on  1261  degrees  of  freedom 
Multiple  R-squared:  0.2369,  Adjusted  R-squared:  0.2345 
F-statistic:  97.87  on  4  and  1261  DF,  p-value :  <  2.2e-16 
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By  looking  at  the  ^-values  in  the  last  column,  one  can  easily  see  (without  calculat¬ 
ing  the  confidence  intervals  or  evaluating  the  t- statistic)  that  there  is  a  significant  asso¬ 
ciation  between  the  bill  and  delivery  time;  also,  it  is  evident  that  the  average  delivery 
time  in  the  branch  in  the  East  is  significantly  different  (~  3  min  less)  from  the  central 
branch  (which  is  the  reference  category  here).  However,  the  estimated  difference  in 
delivery  times  for  both  the  branches  in  the  West  and  the  operator  was  not  found  to  be 
significantly  different  from  zero.  We  conclude  that  some  variables  in  the  model  are 
associated  with  delivery  time,  while  for  others,  we  could  not  show  the  existence  of 
such  an  association.  The  last  line  of  the  output  confirms  this:  the  overall  F-test  has  a 
test  statistic  of  97. 87  which  is  larger  than  F\-a-p^n-p-i  =  Fo.95;4,i26i  =  2.37;there- 
fore,  the  p-\ alue  is  smaller  0.05  (2.2  x  10“ 16)  and  the  null  hypothesis  is  rejected.  The 
test  suggests  that  there  is  at  least  one  variable  which  is  associated  with  delivery  time. 


1 1 .7.1  Properties  of  Least  Squares  and  Maximum  Likelihood 
Estimators 

The  least  squares  estimator  of  p  has  several  properties: 


1.  The  least  squares  estimator  of  ft  is  unbiased ,  E(P)  =  ft  (see  Appendix  C.6  for 
the  proof). 

2.  The  estimator  d2  as  introduced  in  equation  (1 1.27)  is  also  unbiased ,  i.e.  F(<r2)  = 
a2. 

3.  The  least  squares  estimator  of  ft  is  consistent ,  i.e.  p  converges  to  p  as  n  approaches 
infinity. 

4.  The  least  squares  estimator  of  P  is  asymptotically  normally  distributed . 

5.  The  least  squares  estimator  of  p  has  the  smallest  variance  among  all  linear  and 
unbiased  estimators  ( best  linear  unbiased  estimator)  of  P . 

We  do  not  discuss  the  technical  details  of  these  properties  in  detail.  It  is  more 
important  to  know  that  the  least  squares  estimator  has  good  features  and  that  is 
why  we  choose  it  for  fitting  the  model.  Since  we  use  a  “good”  estimator,  we  expect 
that  the  model  will  also  be  “good”.  One  may  ask  whether  it  is  possible  to  use  a 
different  estimator.  We  have  already  made  distributional  assumptions  in  the  model: 
we  require  the  errors  to  be  normally  distributed,  given  that  it  is  indeed  possible  to 
apply  the  maximum  likelihood  principle  to  obtain  estimates  for  P  and  o2  in  our 
set-up. 

Theorem  11.7.1  For  the  linear  model  (11.24),  the  least  squares  estimator  and  the 
maximum  likelihood  estimator  for  P  are  identical  However,  the  maximum  likelihood 
estimator  of  a2  is  =  1  /n(e'e)  of  a2  which  is  a  biased  estimator  of  a2,  but  it  is 
asymptotically  unbiased. 

How  to  obtain  the  maximum  likelihood  estimator  for  the  linear  model  is  presented  in 
Appendix  C.6.  The  important  message  of  Theorem  11.7.1  is  that  no  matter  whether 
we  apply  the  least  squares  principle  or  the  maximum  likelihood  principle,  we  always 
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A  j 

obtain  /}  =  (X'Xr'X'v  ;  this  does  not  apply  when  estimating  the  variance,  but  it  is 
an  obvious  choice  to  go  for  the  unbiased  estimator  (1 1.27)  in  any  given  analysis. 

11.7.2  The  ANOVA  Table 

A  table  that  is  frequently  shown  by  software  packages  when  performing  regression 
analysis  is  the  analysis  of  variance  (ANOVA)  table.  This  table  can  have  several  mean¬ 
ings  and  interpretations  and  may  look  slightly  different  depending  on  the  context  in 
which  it  is  used.  We  focus  here  on  its  interpretation  i)  as  a  way  to  summarize  the 
effect  of  categorical  variables  and  ii)  as  a  hypothesis  test  to  compare  k  means.  This 
is  illustrated  in  the  following  example. 

Example  11.7.3  Recall  Example  11.6.3  where  we  established  the  relationship 
between  branch  and  delivery  time  as 

Delivery  Time  =  36.3  —  5.2East  —  1.1  West. 

The  R  output  for  the  respective  linear  model  is  as  follows: 


Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

36.3127 

0.2957 

122.819 

<  2e-16 

*** 

branchEast 

-5.2461 

0.4209 

-12.463 

<  2e-16 

*** 

branchWest 

-1.1182 

0.4148 

-2.696 

0.00711 

** 

We  see  that  the  average  delivery  time  of  the  branches  in  the  East  and  the  Centre 
(reference  category)  is  different  and  that  the  average  delivery  time  of  the  branches  in 
the  West  and  the  Centre  is  different  (because  p  <  0.05).  This  is  useful  information, 
but  it  does  not  answer  the  question  if  branch  as  a  whole  influences  the  delivery  time. 
It  seems  that  this  is  the  case,  but  the  hypothesis  we  may  have  in  mind  may  be 

Hq  :  /XEast  =  M West  =  /^Centre 

which  corresponds  to 

Ho  /feast  =  /^West  —  /feentre 

in  the  context  of  the  regression  model.  These  are  two  identical  hypotheses  because 
in  the  regression  set-up,  we  are  essentially  comparing  three  conditional  means 
E(Y\X  =  vi)  =  E(Y\X  =  X2)  =  E(Y\X  =  V3).  The  ANOVA  table  summarizes  the 
corresponding  E-Test  which  tests  this  hypothesis: 

ml  <-  lm(time~branch)  O 

1 1 

anova(ml) 

Response:  time 

Df  Sum  Sq  Mean  Sq  F  value  Pr(>F) 

branch  2  6334  3166.8  86.05  <  2.2e-16  *** 

Residuals  1263  46481  36.8 
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We  see  that  the  null  hypothesis  of  3  equal  means  is  rejected  because  p  is  close  to 
zero. 

What  does  this  table  mean  more  generally?  If  we  deal  with  linear  regression  with 
one  (possibly  categorical)  covariate,  the  table  will  be  as  follows: 


df 

Sum  of  squares 

Mean  squares 

F-  statistic 

Var 

P 

SQReg. 

MSR=SQReg  Ip 

MSR/MSE 

Res 

n  —  p  —  1 

SQError 

MSE=  SQError Kfl  -  p  -  V) 

The  table  summarizes  the  sum  of  squares  regression  and  residuals  (see  Sect.  1 1.3  for 
the  definition),  standardizes  them  by  using  the  appropriate  degrees  of  freedom  (df), 
and  uses  the  corresponding  ratio  as  the  F- statistic.  Note  that  in  the  above  example, 
this  corresponds  to  the  overall  F- test  introduced  earlier.  The  overall  F- test  tests  the 
hypothesis  that  any  /3j  is  different  from  zero  which  is  identical  to  the  hypothesis 
above.  Thus,  if  we  fit  a  linear  regression  model  with  one  variable,  the  ANOVA  table 
will  yield  the  same  conclusions  as  the  overall  F- test  which  we  obtain  through  the 
main  summary.  However,  if  we  consider  a  multiple  linear  regression  model,  the 
ANOVA  table  may  give  us  more  information. 

Example  11.7.4  Suppose  we  are  not  only  interested  in  the  branch,  but  also  in  how 
the  pizza  delivery  times  are  affected  by  operator  and  driver.  We  may  for  example 
hypothesize  that  the  delivery  time  is  identical  for  all  drivers  given  they  deliver  for 
the  same  branch  and  speak  to  the  same  operator.  In  a  standard  regression  output,  we 
would  get  4  coefficients  for  5  drivers  which  would  help  us  to  compare  the  average 
delivery  time  of  each  driver  to  the  reference  driver;  it  would  however  not  tell  us  if 
on  an  average,  they  all  have  the  same  delivery  time.  The  overall  F- test  would  not 
help  us  either  because  it  would  test  if  any  /3j  is  different  from  zero  which  includes 
coefficients  from  branch  and  operator,  not  only  driver.  Using  the  anova  command 
yields  the  results  of  the  F- test  of  interest: 

m2  <-  lm(time~branch+operator+driver) 

anova (m2) 


Response:  time 


Df 

Sum  Sq 

Mean  Sq 

F  value 

Pr (>F) 

branch 

2 

6334 

3166.8 

88.6374 

<  2 . 2e-16  *** 

operator 

1 

16 

16.3 

0.4566 

0.4994 

driver 

4 

1519 

379.7 

10.6282 

1 . 798e-08  *** 

Residuals 

1258 

44946 

35.7 

We  see  that  the  null  hypothesis  of  equal  delivery  times  of  the  drivers  is  rejected. 
We  can  also  test  other  hypotheses  with  this  output:  for  instance,  the  null  hypothesis 
of  equal  delivery  times  for  each  operator  is  not  rejected  because  p  &  0.5. 
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11.7.3  Interactions 

It  may  be  possible  that  the  joint  effect  of  some  covariates  affects  the  response. 
For  example,  drug  concentrations  may  have  a  different  effect  on  men,  woman,  and 
children;  a  fertilizer  could  work  differently  in  different  geographical  areas;  or  a  new 
education  programme  may  show  benefit  only  with  certain  teachers.  It  is  fairly  simple 
to  target  such  questions  in  linear  regression  analysis  by  using  interactions.  Interac¬ 
tions  are  measured  as  the  product  of  two  (or  more)  variables.  If  either  one  or  both 
variables  are  categorical,  then  one  simply  builds  products  for  each  dummy  variable, 
thus  creating  (k  —  1)  x  (/  —  1)  new  variables  when  dealing  with  two  categorical 
variables  (with  k  and  /  categories  respectively).  These  product  terms  are  called  inter¬ 
actions  and  estimate  how  an  association  of  one  variable  differs  with  respect  to  the 
values  of  the  other  variable.  Now,  we  give  examples  for  continuous-categorical, 
categorical-categorical,  and  continuous-continuous  interactions  for  two  variables 
x\  and  X2. 

Categorical-Continuous  Interaction.  Suppose  one  variable  x\  is  categorical  with 
k  categories,  and  the  other  variable  X2  is  continuous.  Then,  k  —  1  new  variables 
have  to  be  created,  each  consisting  of  the  product  of  the  continuous  variable  and  a 
dummy  variable,  X2  x  xi . ,  i  e  1 , . . . ,  k  —  1 .  These  variables  are  added  to  the  regres¬ 
sion  model  in  addition  to  the  main  effects  relating  to  xi  and  X2  as  follows: 

y  —  Po  +  Pi^ii  +  •  •  •  +  Pk-ixik-!  +  Pkx2 

+Pk+\xi{*2  H - 1-  PpX ik_iX2  +  e. 

It  follows  that  for  the  reference  category  of  xi ,  the  effect  of  X2  is  just  Pk  (because 
each  X2X1 .  is  zero  since  each  xi .  is  zero).  However,  the  effect  for  all  other  categories 
is  P2  +  Pj  where  Pj  refers  to  xijX2.  Therefore,  the  association  between  X2  and  the 
outcome  y  differs  by  Pj  between  category  j  and  the  reference  category.  Testing 
Ho  :  pj  =  0  thus  helps  to  identify  whether  there  is  an  interaction  effect  with  respect 
to  category  j  or  not. 

Example  11.7.5  Consider  again  the  pizza  data  described  in  Appendix  A.4.  We  may 
be  interested  in  whether  the  association  of  delivery  time  and  temperature  varies 
with  respect  to  branch.  In  addition  to  time  and  branch,  we  therefore  need  additional 
interaction  variables.  Since  there  are  3  branches,  we  need  3  —  1  =  2  interaction 
variables  which  are  essentially  the  product  of  (i)  time  and  branch  “East”  and  (ii) 
time  and  branch  “West”.  This  can  be  achieved  in  R  by  using  either  the  “★”  operator 
(which  will  create  both  the  main  and  interaction  effects)  or  the  “:”  operator  (which 
only  creates  the  interaction  term). 

int. model.  1  <-  lm(temperature~time*branch) 

int  .model .  2  <-  lm(temperature~time+branch+time  : branch) 

summary ( int . mode 1 . 1 ) 

summary ( int . mode 1 . 2 ) 
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Coefficients : 


(Intercept)  70 
time  -0 
branchEast  10 
branchWest  1 


time : branchEast  -0 
time : branchWest  0 


timate 

Std.  Error 

718327 

1.850918 

288011 

0.050342 

941411 

2.320682 

102597 

2.566087 

195885 

0.066897 

004352 

0.070844 

t  value  Pr (> | t | ) 
38.207  <  2e-16  *** 

-5.721  1 . 32e-08  *** 
4.715  2.69e-06  *** 
0.430  0.66750 

-2.928  0.00347  ** 

0.061  0.95103 


The  main  effects  of  the  model  tell  us  that  the  temperature  is  almost  1 1  degrees 
higher  for  the  eastern  branch  compared  to  the  central  branch  (reference)  and  about 
1  degree  higher  for  the  western  branch.  However,  only  the  former  difference  is 
significantly  different  from  0  (since  the  p- value  is  smaller  than  a  =  0.05).  Moreover, 
the  longer  the  delivery  time,  the  lower  the  temperature  (0.29  degrees  for  each  minute). 
The  parameter  estimates  related  to  the  interaction  imply  that  this  association  differs 
by  branch:  the  estimate  is  indeed  /3time  =  —0.29  for  the  reference  branch  in  the 
Centre,  but  the  estimate  for  the  branch  in  the  East  is —0.29  —  0.196  =  —0.486  and  the 
estimate  for  the  branch  in  the  West  is  —0.29  +  0.004  =  —0.294.  However,  the  latter 
difference  in  the  association  of  time  and  temperature  is  not  significantly  different 
from  zero.  We  therefore  conclude  that  the  delivery  time  and  pizza  temperature  are 
negatively  associated  and  this  is  more  strongly  pronounced  in  the  eastern  branch 
compared  to  the  other  two  branches.  It  is  also  possible  to  visualize  this  by  means  of 
a  separate  regression  line  for  each  branch,  see  Fig.  1 1.7. 


Fig.  11.7  Interaction  of 
delivery  time  and  branch  in 
Example  11.7.5 
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•  Centre:  Temperature  =  70.7  —  0.29  x  time, 

•  East:  Temperature  =  70.7  +  10.9  —  (0.29  +  0.195)  x  time, 

•  West:  Temperature  =  70.7  +  1.1  —  (0.29  —  0.004)  x  time. 

One  can  see  that  the  pizzas  delivered  by  the  branch  in  the  East  are  overall  hotter 
but  longer  delivery  times  level  that  benefit  off.  One  might  speculate  that  the  delivery 
boxes  from  the  eastern  branch  are  not  properly  closed  and  therefore — despite  the 
overall  good  performance — the  temperature  falls  more  rapidly  over  time  for  this 
branch. 

Categorical-Categorical  Interaction.  For  two  categorical  variables  xi  and  X2,  with 
k  and  /  categories,  respectively,  (k  —  1)  x  (/  —  1)  new  dummy  variables  Xu  x  x^j 
need  to  be  created  as  follows: 

y  —  Po  +  frxij  +  •  •  •  +  ^-lxlk_i  +  Pk*2i  +  •  •  •  +  Pk+l- 2X21_1 

+A+/-1x1ix2i  +  •  •  •  +  PpX  ik_!x2i_i  +  e. 

The  interpretation  of  the  regression  parameters  of  interest  is  less  complicated  than 
it  looks  at  first.  If  the  interest  is  in  xi,  then  the  estimate  referring  to  the  category  of 
interest  (i.e.  x^.)  is  interpreted  as  usual — with  the  difference  that  it  relates  only  to 
the  reference  category  of  X2.  The  effect  of  xif  may  vary  with  respect  to  X2,  and  the 
sum  of  the  respective  main  and  interaction  effects  for  each  category  of  X2  yields 
the  respective  estimates.  These  considerations  are  explained  in  more  detail  in  the 
following  example. 


Example  11.7.6  Consider  again  the  pizza  data.  If  we  have  the  hypothesis  that  the 
delivery  time  depends  on  the  operator  (who  receives  the  phone  calls),  but  the  effect  is 
different  for  different  branches,  then  a  regression  model  with  branch  (3  categories,  2 
dummy  variables),  operator  (2  categories,  one  dummy  variable),  and  their  interaction 
(2  new  variables)  can  be  used. 


lm(time^operator*branch) 


Coefficients : 

Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

36.4203 

0.4159 

87.567 

<2e-16 

*** 

operatorMelissa 

-0.2178 

0.5917 

-0.368 

0.7129 

branchEast 

-5.6685 

0.5910 

-9.591 

<2e-16 

*** 

branchWest 

-1.3599 

0.5861 

-2.320 

0.0205 

* 

operatorMelissa: branchEast 

0.8599 

0.8425 

1.021 

0.3076 

operatorMelissa : branchWest 

0 . 4842 

0.8300 

0.583 

0.5598 
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The  interaction  terms  can  be  interpreted  as  follows: 

•  If  we  are  interested  in  the  operator,  we  see  that  the  delivery  time  is  on  aver¬ 
age  0.21  min  shorter  for  operator  “Melissa”.  When  this  operator  deals  with  a 
branch  other  than  the  reference  (Centre),  the  estimate  changes  to  —0.21  +  0.86  = 
0.64  min  longer  delivery  in  the  case  of  branch  “East”  and  —0.21  +  0.48  = 
0.26  min  for  branch  “West”. 

•  If  we  are  interested  in  the  branches,  we  observe  that  the  delivery  time  is  shortest 
for  the  eastern  branch  which  has  on  average  a  5.66 min  shorter  delivery  time 
than  the  central  branch.  However,  this  is  the  estimate  for  the  reference  operator 
only;  if  operator  “Melissa”  is  in  charge,  then  the  difference  in  delivery  times 
for  the  two  branches  is  only  —5.66  +  0.86  =  —4.8  min.  The  same  applies  when 
comparing  the  western  branch  with  the  central  branch:  depending  on  the  operator, 
the  difference  between  these  two  branches  is  estimated  as  either  — 1.36  or— 1.36  + 
0.48  =  0.88  min,  respectively. 

•  The  interaction  terms  are  not  significantly  different  from  zero.  We  therefore  con¬ 
clude  that  the  hypothesis  of  different  delivery  times  for  the  two  operators,  possibly 
varying  by  branch,  could  not  be  confirmed. 

Continuous-Continuous  Interaction.  It  is  also  possible  to  add  an  interaction  of 
two  continuous  variables.  This  is  done  by  adding  the  product  of  these  two  variables 
to  the  model.  If  xi  and  X2  are  two  continuous  variables,  then  x  1X2  is  the  new  variable 
added  in  the  model  as  an  interaction  effect  as 

y  =  Pixi  +  /32X2  +  P3X1X2  +  e. 


Example  11.7.7  If  we  again  consider  the  pizza  data,  with  pizza  temperature  as  an 
outcome,  we  may  wish  to  test  for  an  interaction  of  bill  and  time. 


Coefficients : 


(Intercept) 

bill 

time 

bill : time 


Estimate  Std.  Error  t  value  Pr ( > I t I ) 


92.555943 

-0.454381 

-0.679537 

0.008687 


2.747414  33.688  <  2e-16  *** 
0.068322  -6.651  4.34e-ll  *** 
0.086081  -7.894  6.31e-15  *** 
0.002023  4.294  1.89e-05  *** 


The  R  output  above  reveals  that  there  is  a  significant  interaction  effect.  The  inter¬ 
pretation  is  more  difficult  here.  It  is  clear  that  a  longer  delivery  time  and  a  larger  bill 
decrease  the  pizza’s  temperature.  However,  for  a  large  product  of  bill  and  time  (i.e., 
when  both  are  large),  these  associations  are  less  pronounced  because  the  negative 
coefficients  become  more  and  more  outweighed  by  the  positive  interaction  term.  On 
the  contrary,  for  a  small  bill  and  short  delivery  time,  the  temperature  can  be  expected 
to  be  quite  high. 

Combining  Estimates.  As  we  have  seen  in  the  examples  in  this  section,  it  can 
make  sense  to  combine  regression  estimates  when  interpreting  interaction  effects. 
While  it  is  simple  to  obtain  the  point  estimates,  it  is  certainly  also  important  to  report 
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their  95  %  confidence  intervals.  As  we  know  from  Theorem  7.7.1,  the  variance  of 
the  combination  of  two  random  variables  can  be  obtained  by  Var(X  db  Y)  =  o\Y  — 
Var(X)  +  Var(y)  ±  2Cov(X,  Y).  We  can  therefore  estimate  a  confidence  interval 

/V  /V 

for  the  combined  estimate  Pi  +  Pj  as 

(Pi  +  Pj)  ±  tn-p- V,l-a/2  •  Vyji+fjj)  (11.29) 

where  cr^+^  is  obtained  from  the  estimated  covariance  matrix  Cov(/?)  via 

CT(/§.+j0.)  =  +  Vtt(Pj)  +  2Co  v(Pi,Pj). 

Example  11.7.8  Recall  Example  1 1.7.5  where  we  estimated  the  association  between 
pizza  temperature,  delivery  time,  and  branch.  There  was  a  significant  interaction 
effect  for  time  and  branch.  Using  R ,  we  obtain  the  covariance  matrix  as  follows: 


mymodel  <-  lm(temperature~time*branch) 
vcov(mymodel) 


(Int . ) 

time 

East 

West 

time : East 

time : West 

(Int . ) 

3.4258 

-0.09202 

-3.4258 

-3.4258 

0.09202 

0.09202 

time 

-0.0920 

0.00253 

0.0920 

0.0920 

-0.00253 

-0.00253 

branchEast 

-3.4258 

0.09202 

5.3855 

3.4258 

-0.15232 

-0.09202 

branchWest 

-3.4258 

0.09202 

3.4258 

6 . 5848 

-0.09202 

-0.17946 

time : East 

0.0920 

-0.00253 

-0.1523 

-0.0920 

0 . 00447 

0.00253 

time : West 

0.0920 

-0.00253 

-0.0920 

-0.1794 

0.00253 

0.00501 

The  point  estimate  for  the  association  between  time  and  temperature  in  the  eastern 
branch  is  —0.29  —  0.19  (see  Example  11.7.5).  The  standard  error  is  calculated  as 
VO. 00253  +  0.00447  —  2  •  0.00253  =  0.044.  The  confidence  interval  is  therefore 
-0.48  =b  1.96  •  0.044  =  [-0.56;  -0.39]. 

Remark  11.7.1  If  there  is  more  than  one  interaction  term,  then  it  is  generally  possible 
to  test  whether  there  is  an  overall  interaction  effect,  such  as  p\  =  P2  =  P3  =  Pa  =  0 
in  the  case  of  four  interaction  variables.  These  tests  belong  to  the  general  class  of 
“linear  hypotheses”,  and  they  are  not  explained  in  this  book.  It  is  also  possible  to 
create  interactions  between  more  than  two  variables.  However,  the  interpretation 
then  becomes  difficult. 


1 1 .8  Comparing  Different  Models 


There  are  many  situations  where  different  multiple  linear  models  can  be  fitted  to  a 
given  data  set,  but  it  is  unclear  which  is  the  best  one.  This  relates  to  the  question  of 
which  variables  should  be  included  in  a  model  and  which  ones  can  be  removed. 
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For  example,  when  we  modelled  the  association  of  recovery  time  and  exercising 
time  by  using  different  transformations,  it  remained  unclear  which  of  the  proposed 
transformations  was  the  best.  In  the  pizza  delivery  example,  we  have  about  ten 
covariates:  Does  it  make  sense  to  include  all  of  them  in  the  model  or  do  we  understand 
the  associations  in  the  data  better  by  restricting  ourselves  to  the  few  most  important 
variables?  Is  it  necessary  to  include  interactions  or  not? 

There  are  various  model  selection  criteria  to  compare  the  quality  of  different 
fitted  models.  They  can  be  useful  in  many  situations,  but  there  are  also  a  couple  of 
limitations  which  make  model  selection  a  difficult  and  challenging  task. 


Coefficient  of  Determination  ( R 2).  A  possible  criterion  to  check  the  goodness  of 
fit  of  a  model  is  the  coefficient  of  determination  ( R 2).  We  discussed  its  development, 
interpretation,  and  philosophy  in  Sect.  11.3  (see  p.  256  for  more  details). 


'I  Q 

Adjusted  R  .  Although  R  is  a  reasonable  measure  for  the  goodness  of  fit  of  a 
model,  it  also  has  some  limitations.  One  limitation  is  that  if  the  number  of  covariates 
increases,  R 2  increases  too  (we  omit  the  theorem  and  the  proof  for  this  statement). 
The  added  variables  may  not  be  relevant,  but  R2  will  still  increase,  wrongly  indicating 
an  improvement  in  the  fitted  model.  This  criterion  is  therefore  not  a  good  measure 
for  model  selection.  An  adjusted  version  of  R2  is  defined  as 


SQEnor/(ft  P  1) 

S 2 Total/ (n  ~  1) 


(11.30) 


It  can  be  used  to  compare  the  goodness  of  fit  of  models  with  a  different  number 
of  covariates.  Please  note  that  both  R 2  and  R2dj  are  only  defined  when  there  is  an 
intercept  term  in  the  model  (which  we  assume  throughout  the  chapter). 


Example  11.8.1  In  Fig.  11.6,  the  association  of  exercising  time  and  recovery  time 
after  knee  surgery  is  modelled  linearly,  quadratically,  and  cubically;  in  Fig.  11.5, 
this  association  is  modelled  by  means  of  a  square-root  transformation.  The  model 
summary  in  R  returns  both  R2  (under  ‘‘Multiple  R-squared”)  and  the  adjusted  R2 
(under  “adjusted  R-squared”).  The  results  are  as  follows: 


Linear  0.6584  0.6243 
Quadratic  0.6787  0.6074 
Cubic  0.7151  0.6083 
Square  root  0.6694  0.6363 


It  can  be  seen  that  R2  is  larger  for  the  models  with  more  variables;  i.e.  the  cubic 
model  (which  includes  three  variables)  has  the  largest  R2.  The  adjusted  R 2,  which 
takes  the  different  model  sizes  into  account,  favours  the  model  with  the  square-root 
transformation.  This  model  provides  therefore  the  best  fit  to  the  data  among  the  four 
models  considered  using  R2. 
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Akaike’s  Information  Criterion  (AIC)  is  another  criterion  for  model  selection.  The 
AIC  is  based  on  likelihood  methodology  and  has  its  origins  in  information  theory. 
We  do  not  give  a  detailed  motivation  of  this  criterion  here.  It  can  be  written  for  the 
linear  model  as 

AIC  =  n  log(^Em)r)  +  2 (p  +  1).  (11.31) 

The  smaller  the  AIC,  the  better  the  model.  The  AIC  takes  not  only  the  fit  to  the  data 
via  S  0 Error  into  account  but  also  the  parsimony  of  the  model  via  the  term  2(/?  +  1). 
It  is  in  this  sense  a  more  mature  criterion  than  R2d-  which  considers  only  the  fit 
to  the  data  via  S^Error-  Akaike’s  Information  Criterion  can  be  calculated  in  R  via 
the  extract  AIC  ()  command.  There  are  also  other  commands,  but  the  results  differ 
slightly  because  the  different  formulae  use  different  constant  terms.  However,  no 
matter  what  formula  is  used,  the  results  regarding  the  model  choice  are  always  the 
same. 

Example  11.8.2  Consider  again  Example  11.8.1.  R2d-  preferred  the  model  which 
includes  exercising  time  via  a  square-root  transformation  over  the  other  options.  The 
AIC  value  for  the  model  where  exercise  time  is  modelled  linearly  is  60.59,  when 
modelled  quadratically  61.84,  when  modelled  cubically  62.4,  and  60.19  for  a  square- 
root  transformation.  Thus,  in  line  with  R2d- ,  the  AIC  also  prefers  the  square-root 
transformation. 


Backward  selection.  Two  models,  which  differ  only  by  one  variable  xy,  can  be 
compared  by  simply  looking  at  the  test  result  for  =0:  if  the  null  hypothesis  is 
rejected,  the  variable  is  kept  in  the  model;  otherwise,  the  other  model  is  chosen.  If 
there  are  more  than  two  models,  then  it  is  better  to  consider  a  systematic  approach  to 
comparing  them.  For  example,  suppose  we  have  10  potentially  relevant  variables  and 
we  are  not  sure  which  of  them  to  include  in  the  final  model.  There  are  2 10  =  1024 
possible  different  combinations  of  variables  and  in  turn  so  many  choices  of  models ! 
The  inclusion  or  deletion  of  variables  can  be  done  systematically  with  various  proce¬ 
dures,  for  example  with  backward  selection  (also  known  as  backward  elimination) 
as  follows: 

1.  Start  with  the  full  model  which  contains  all  variables  of  interest,  T  =  {xi,  X2, 

. . . ,  Xp}. 

2.  Remove  the  variable  xj  e  T  which  optimizes  a  criterion,  i.e.  leads  to  the  smallest 
AIC  (the  highest  R2dj ,  the  highest  test  statistic,  or  the  smallest  significance) 
among  all  possible  choices. 

3.  Repeat  step  2.  until  a  stop  criterion  is  fulfilled,  i.e.  until  no  improvement  regarding 
AIC,  R2,  or  the  p -value  can  be  achieved. 
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There  are  several  other  approaches.  Instead  of  moving  in  the  backward  direction, 
we  can  also  move  in  the  forward  direction.  Forward  selection  starts  with  a  model 
with  no  covariates  and  adds  variables  as  long  as  the  model  continues  to  improve  with 
respect  to  a  particular  criterion.  The  stepwise  selection  approach  combines  forward 
selection  and  backward  selection:  either  one  does  backward  selection  and  checks  in 
between  whether  adding  variables  yields  improvement,  or  one  does  forward  selection 
and  continuously  checks  whether  removing  the  already  added  variables  improves  the 
model. 


Example  11.8.3  Consider  the  pizza  data:  if  delivery  time  is  the  outcome,  and  branch, 
bill,  operator,  driver,  temperature,  and  number  of  ordered  pizzas  are  potential  covari¬ 
ates,  we  may  decide  to  include  only  the  relevant  variables  in  a  final  model.  Using 
the  stepAIC  function  of  the  library  (MASS)  allows  us  implementing  backward 
selection  with  R. 

library (MASS) 

ms  <-  lm(time~branch+bill+operator+driver 

+temperature+pizzas) 

stepAIC (ms,  direction= ' back ' ) 


At  the  first  step,  the  R  output  states  that  the  AIC  of  the  full  model  is  4277.56. 
Then,  the  AIC  values  are  displayed  when  variables  are  removed:  4275.9  if  operator 
is  removed,  4279.2  if  driver  is  removed,  and  so  on.  One  can  see  that  the  AIC  is 
minimized  if  operator  is  excluded. 

Start:  AIC=4277.56 

time  ~  branch+bill+operator+driver+temperature+pizzas 


-  operator 
<none> 

-  driver 

-  branch 

-  pizzas 

-  temperature 

-  bill 


Df  Sum  of  Sq  RSS  AIC 

1  9.16  36508  4275.9 

36499  4277.6 
4  279.71  36779  4279.2 

2  532.42  37032  4291.9 

1  701.57  37201  4299.7 

1  1931.50  38431  4340.8 

1  2244.28  38743  4351.1 


Now  R  fits  the  model  without  operator.  The  AIC  is  4275.88.  Excluding  further 
variables  does  not  improve  the  model  with  respect  to  the  AIC. 
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Step:  AIC=4275 . 88 

time  ~  branch  +  bill  +  driver  +  temperature  +  pizzas 


Df 

Sum  of  Sq 

RSS 

AIC 

<none> 

36508 

4275.9 

-  driver 

4 

288.45 

36797 

4277.8 

-  branch 

2 

534.67 

37043 

4290.3 

-  pizzas 

1 

705.53 

37214 

4298.1 

-  temperature 

1 

1923.92 

38432 

4338.9 

-  bill 

1 

2249.60 

38758 

4349 . 6 

We  therefore  conclude  that  the  final  “best”  model  includes  all  variables  considered, 
except  operator.  Using  stepwise  selection  (with  option  both  instead  of  back)  yields 
the  same  results.  We  could  now  interpret  the  summary  of  the  chosen  model. 


Limitations  and  further  practical  considerations.  The  above  discussions  of  vari¬ 
able  selection  approaches  makes  it  clear  that  there  are  a  variety  of  options  to  compare 
fitted  models,  but  there  is  no  unique  best  way  to  do  so.  Different  criteria  are  based 
on  different  philosophies  and  may  lead  to  different  outcomes.  There  are  a  few  con¬ 
siderations  that  should  however  always  be  taken  into  account: 

•  If  categorical  variables  are  included  in  the  model,  then  all  respective  dummy  vari¬ 
ables  need  to  be  considered  as  a  whole:  all  dummy  variables  of  this  variable  should 
either  be  removed  or  kept  in  the  model.  For  example,  a  variable  with  3  categories 
is  represented  by  two  dummy  variables  in  a  regression  model.  In  the  process  of 
model  selection,  both  of  these  variables  should  either  be  kept  in  the  model  or 
removed,  but  not  just  one  of  them. 

•  A  similar  consideration  refers  to  interactions.  If  an  interaction  is  added  to  the 
model,  then  the  main  effects  should  be  kept  in  the  model  as  well.  This  enables 
straightforward  interpretations  as  outlined  in  Sect.  11.7.3. 

•  The  same  applies  to  polynomials.  If  adding  a  cubic  transformation  of  x\ ,  the  squared 
transformation  as  well  as  the  linear  effect  should  be  kept  in  the  model. 

•  When  applying  backward,  forward,  or  stepwise  regression  techniques,  the  results 
may  vary  considerably  depending  on  the  method  chosen!  This  may  give  different 
models  for  the  same  data  set.  Depending  on  the  strategy  used  to  add  and  remove 
variables,  the  choice  of  the  criterion  (e.g.  AIC  versus  p -values),  and  the  detailed 
set-up  (e.g.  if  the  strategy  is  to  add/remove  a  variable  if  the  p-v alue  is  smaller  than 
a ,  then  the  choice  of  a  is  important),  one  may  obtain  different  results. 

•  All  the  above  considerations  show  that  model  selection  is  a  non-trivial  task  which 
should  always  be  complemented  by  substantial  knowledge  of  the  subject  matter. 
If  there  are  not  too  many  variables  to  choose  from,  simply  reporting  the  full  model 
can  be  meaningful  too. 
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1 1 .9  Checking  Model  Assumptions 

The  assumptions  made  for  the  linear  model  mostly  relate  to  the  error  terms:  e  ~ 
N(0,  a 2 1).  This  implies  a  couple  of  things:  i)  the  difference  between  y  and  X/? 
(which  is  e)  needs  to  follow  a  normal  distribution,  ii)  the  variance  for  each  error  et 
is  constant  as  a2  and  therefore  does  not  depend  on  /,  and  iii)  there  is  no  correlation 
between  the  error  terms,  Cov(^/,  ey)  =  0  for  all  i,  i' .  We  also  assume  that  X/?  is 
adequate  in  the  sense  that  we  included  all  relevant  variables  and  interactions  in  the 

/V 

model  and  that  the  estimator  is  stable  and  adequate  due  to  influential  observations 
or  highly  correlated  variables.  Now,  we  demonstrate  how  to  check  assumptions  i) 
and  ii). 

Normality  assumption.  The  errors  e  are  assumed  to  follow  a  normal  distribution. 
We  never  know  the  errors  themselves.  The  only  possibility  is  to  estimate  the  errors 

/V 

by  means  of  the  residuals  e  =  y  —  Xfi.  To  ensure  that  the  estimated  residuals  are 
comparable  to  the  scale  of  a  standard  normal  distribution,  they  can  be  standardized 
via 

_  G 

Ci  ~ 

where  Pa  is  the  i th  diagonal  element  of  the  hat  matrix  P  =  X(X'X)_1X'.  Generally, 
the  estimated  standardized  residuals  are  used  to  check  the  normality  assumption.  To 
obtain  a  first  impression  about  the  error  distribution,  one  can  simply  plot  a  histogram 
of  the  standardized  residuals.  The  normality  assumption  is  checked  with  a  QQ-plot 
where  the  theoretical  quantiles  of  a  standard  normal  distribution  are  plotted  against 
the  quantiles  from  the  standardized  residuals.  If  the  data  approximately  matches  the 
bisecting  line,  then  we  have  evidence  for  fulfilment  of  the  normality  assumption 
(Fig.  11.8a),  otherwise  we  do  not  (Fig.  11.8b). 


Theoretical  Quantiles 


(a)  Normality  without  problems 


(b)  Normality  with  some  problems 


Fig.  1 1 .8  Checking  the  normality  assumption  with  a  QQ-plot 
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Small  deviations  from  the  normality  assumption  are  not  a  major  problem  as  the 
least  squares  estimator  of  p  remains  unbiased.  However,  confidence  intervals  and 
tests  of  hypothesis  rely  on  this  assumption,  and  particularly  for  small  sample  sizes 
and/or  stronger  violations  of  the  normality  assumptions,  these  intervals  and  conclu¬ 
sions  from  tests  of  hypothesis  no  longer  remain  valid.  In  some  cases,  these  problems 
can  be  fixed  by  transforming  y. 

Heteroscedasticity.  If  errors  have  a  constant  variance,  we  say  that  the  errors  are 
homoscedastic  and  we  call  this  phenomenon  homoscedasticity.  When  the  variance 
of  errors  is  not  constant,  we  say  that  the  errors  are  heteroscedastic.  This  phenom¬ 
enon  is  also  known  as  heteroscedasticity.  If  the  variance  a2  depends  on  /,  then 
the  variability  of  the  e*  will  be  different  for  different  groups  of  observations.  For 
example,  the  daily  expenditure  on  food  (y)  may  vary  more  among  persons  with  a 
high  income  (xi),  so  fitting  a  linear  model  yields  stronger  variability  of  the  e*  among 
higher  income  groups.  Plotting  the  fitted  values  y/  (or  alternatively  jc*)  against  the 
standardized  residuals  (or  a  transformation  thereof)  can  help  to  detect  whether  or  not 
problems  exist.  If  there  is  no  pattern  in  the  plot  (random  plot),  then  there  is  likely  no 
violation  of  the  assumption  (see  Fig.  1 1.10a).  However,  if  there  is  a  systematic  trend, 
i.e.  higher/lower  variability  for  higher/lower  y* ,  then  this  indicates  heteroscedasticity 
(Fig.  1 1 . 10b,  trumpet  plot).  The  consequences  are  again  that  confidence  intervals  and 
tests  may  no  longer  be  correct;  again,  in  some  situations,  a  transformation  of  y  can 
help. 

Example  11.9.1  Recall  Example  11.8.3  where  we  identified  a  good  model  to  describe 
the  delivery  time  for  the  pizza  data.  Branch,  bill,  temperature,  number  of  pizzas 
ordered,  and  driver  were  found  to  be  associated  with  delivery  time.  To  explore 
whether  the  normality  assumption  is  fulfilled  for  this  model,  we  can  create  a  his¬ 
togram  for  the  standardized  residuals  (using  the  R  function  rstandardO).  A  QQ- 
plot  is  contained  in  the  various  model  diagnostic  plots  of  the  plot  ()  function. 

fm  <-  lm(time~branch+bill+driver+temperature 
+pizzas) 

hist (rstandard (f m) ) 
plot(fm,  which=2) 

Figure  11.9  shows  a  reasonably  symmetric  distribution  of  the  residuals,  maybe 
with  a  slightly  longer  tail  to  the  right.  As  expected  for  a  standard  normal  distribution, 
not  many  observations  are  larger  than  2  or  smaller  than  —2  (which  are  close  to  the  2.5 
and  97.5  %  quantiles).  The  QQ-plot  also  points  towards  a  normal  error  distribution 
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Standardized  residuals 

(a)  Via  the  histogram  of  the  standardized 
residuals 


Theoretical  Quantiles 

lm(time  ~  branch  +  bill  +  driver  +  temperature  +  pizzas) 


(b)  Via  a  QQ-plot  of  the  standardized 
residuals 


Fig.  11.9  Checking  the  normality  assumption  in  Example  11.9.1 


because  the  observed  quantiles  lie  approximately  on  the  bisecting  line.  It  seems  that 
the  lowest  residuals  deviate  a  bit  from  the  expected  normal  distribution,  but  not 
extremely.  The  normality  assumption  does  not  seem  to  be  violated. 

Plotting  the  fitted  values  y*  against  the  square  root  of  the  absolute  values  of  the 
standardized  residuals  (=  y/\ef\,  used  by  R  for  stability  reasons)  yields  a  plot  with 
no  visible  structure  (see  Fig.  11.10a).  There  is  no  indication  of  heteroscedasticity. 


(a)  Example  11.9.1:  chaos  plot 


(b)  Violation:  a2  depends  on  i 


Fig.  11.10  Checking  the  heteroscedasticity  assumption 
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plot(fm,  which=3) 


Figure  11.10b  shows  an  artificial  example  where  heteroscedasticity  occurs:  the 
higher  the  fitted  values,  the  higher  is  the  variation  of  residuals.  This  is  also  called  a 

trumpet  plot. 


11.10  Association  Versus  Causation 

There  is  a  difference  between  association  and  causation:  association  says  that  higher 
values  of  X  relate  to  higher  values  of  Y  (or  vice  versa),  whereas  causation  says  that 
because  values  of  X  are  higher,  values  of  Y  are  higher  too.  Regression  analysis,  as 
introduced  in  this  chapter,  establishes  associations  between  Xf  ’s  and  Y,  not  causation, 
unless  strong  assumptions  are  made. 

For  example,  recall  the  pizza  delivery  example  from  Appendix  A.4.  We  have  seen 
in  several  examples  and  exercises  that  the  delivery  time  varies  by  driver.  In  fact,  in  a 
multiple  linear  regression  model  where  Y  is  the  delivery  time,  and  the  covariates  are 
branch,  bill,  operator,  driver,  and  temperature,  we  get  significant  differences  of  the 
mean  delivery  times  of  some  drivers  (e.g.  Domenico  and  Bruno).  Does  this  mean  that 
because  Domenico  is  the  driver,  the  pizzas  are  delivered  faster?  Not  necessarily.  If 
Domenico  drives  the  fastest  scooter,  then  this  may  be  the  cause  of  his  speedy  deliv¬ 
eries.  However,  we  have  not  measured  the  variable  “type  of  scooter”  and  thus  cannot 
take  it  into  account.  The  result  we  obtain  from  the  regression  model  is  still  useful 
because  it  provides  the  manager  with  useful  hypotheses  and  predictions  about  his 
delivery  times,  but  the  interpretation  that  the  driver’s  abilities  cause  shorter  delivery 
times  may  not  necessarily  be  appropriate. 

This  example  highlights  that  one  of  the  most  important  assumptions  to  interpret 
regression  parameters  in  a  causal  way  is  to  have  measured  (and  used)  all  variables 
Xi  which  affect  both  the  outcome  Y  and  the  variable  of  interest  A  (e.g.  the  variable 
“driver”  above).  Another  assumption  is  that  the  relationship  between  all  Xi ’s  and 
Y  is  modelled  correctly,  for  example  non-linear  if  appropriate.  Moreover,  we  need 
some  a  priori  assumptions  about  how  the  variables  relate  to  each  other,  and  some 
technical  assumptions  need  to  be  met  as  well.  The  interested  reader  is  referred  to 
Hernan  and  Robins  (2017)  for  more  details. 
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11.11  Key  Points  and  Further  Issues 

Note: 

/  If  X  is  a  continuous  variable,  the  parameter  /3  in  the  linear  regression 
model  Y  =  a  +  fiX  +  e  can  be  interpreted  as  follows: 

“For  each  unit  of  increase  in  X ,  there  is  an  increase  of  /3  units  in  E(F).” 

/  If  X  can  take  a  value  of  zero,  then  it  can  be  said  that 

“For  X  =  0,  E(F)  equals  aX 

If  X  =  0  is  out  of  the  data  range  or  implausible,  it  does  not  make  sense 
to  interpret  a. 

/  A  model  is  said  to  be  linear  when  it  is  linear  in  its  parameters. 

/  It  is  possible  to  include  many  different  covariates  in  the  regression 
model:  they  can  be  binary,  log-transformed,  or  take  any  other  form, 
and  it  is  still  a  linear  model.  The  interpretation  of  these  variables  is 
always  conditional  on  the  other  variables;  i.e.  the  reported  association 
assumes  that  all  other  variables  in  the  model  are  held  fixed. 

/  To  evaluate  whether  xj  is  associated  with  y,  it  is  useful  to  test  whether 
/3j  =0.  If  the  null  hypothesis  Hq  :  /3j  =  0  is  rejected  (i.e.  the  p -value  is 
smaller  than  a ),  one  says  that  there  is  a  significant  association  of  xj  and 
y.  However,  note  that  a  non-significant  result  does  not  necessarily  mean 
there  is  no  association,  it  just  means  we  could  not  show  an  association. 

/  An  important  assumption  in  the  multivariate  linear  model  y  =  X/?  +  e 
relates  to  the  errors: 

e  ~  A(0,  cr2 1). 

To  check  whether  the  errors  are  (approximately)  normally  distributed, 
one  can  plot  the  standardized  residuals  in  a  histogram  or  a  QQ-plot.  The 
heteroscedasticity  assumption  (i.e.  a2  does  not  depend  on  i)  is  tested 
by  plotting  the  fitted  values  (y/)  against  the  standardized  residuals  (£/). 
The  plot  should  show  no  pattern  (random  plot). 

/  Different  models  can  be  compared  by  systematic  hypothesis  testing, 
R2dj ,  AIC,  and  other  methods.  Different  methods  may  yield  different 
results,  and  there  is  no  unique  best  option. 
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11.12  Exercises 

Exercise  11.1  The  body  mass  index  (BMI)  and  the  systolic  blood  pressure  of  6 
people  were  measured  to  study  a  cardiovascular  disease.  The  data  are  as  follows: 


Body  mass  index 

26 

23 

27 

28 

24 

25 

Systolic  blood  pressure 

170 

150 

160 

175 

155 

150 

(a)  The  research  hypothesis  is  that  a  high  BMI  relates  to  a  high  blood  pressure. 
Estimate  the  linear  model  where  blood  pressure  is  the  outcome  and  BMI  is  the 
co variate.  Interpret  the  coefficients. 

(b)  Calculate  R 2  to  judge  the  goodness  of  fit  of  the  model. 

Exercise  11.2  A  psychologist  speculates  that  spending  a  lot  of  time  on  the  internet 
has  a  negative  effect  on  children’s  sleep.  Consider  the  following  data  on  hours  of  deep 
sleep  (T)  and  hours  spent  on  the  internet  (X)  where  Xi  and  y;  are  the  observations 
on  internet  time  and  deep  sleep  time  of  the  i  th  (i  =  1,2,  ...  ,9)  child  respectively: 


Child  i 

1 

2 

3 

4 

5 

6 

7 

8 

9 

Internet  time  Xj  (in  h) 

0.3 

2.2 

0.5 

0.7 

1.0 

1.8 

3.0 

0.2 

2.3 

Sleep  time  37  (in  h) 

Ui 

bo 

4.4 

6.5 

5.8 

5.6 

5.0 

4.8 

6.0 

6.1 

(a)  Estimate  the  linear  regression  model  for  the  given  data  and  interpret  the  coeffi¬ 
cients. 

(b)  Calculate  R 2  to  judge  the  goodness  of  fit  of  the  model. 

(c)  Reproduce  the  results  of  a)  and  b)  in  R  and  plot  the  regression  line. 

(d)  Now  assume  that  we  only  distinguish  between  spending  more  than  1  hour  on  the 
internet  (X  =  1)  and  spending  less  than  (or  equal  to)  one  hour  on  the  internet 
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(X  =  0).  Estimate  the  linear  model  again  and  compare  the  results.  How  can  f3 
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now  be  interpreted?  Describe  how  f3  changes  if  those  who  spend  more  than  one 
hour  on  the  internet  are  coded  as  0  and  the  others  as  1 . 

Exercise  11.3  Consider  the  following  data  on  weight  and  height  of  17  female  stu¬ 
dents: 


Student  i 
Weight  >v 
Height  Xi 

1 

68 

174 

2 

58 

164 

3 

53 

164 

4 

60 

165 

5 

59 

170 

6 

60 

168 

7 

55 

167 

8 

62 

166 

9 

58 

160 

Student  i 

10 

11 

12 

13 

14 

15 

16 

17 

Weight  y 

53 

53 

50 

64 

77 

60 

63 

69 

Height  x 

160 

163 

157 

168 

179 

170 

168 

170 

(a)  Calculate  the  correlation  coefficient  of  Bravais-Pearson  (use  X/=i  xOi  — 
170,  821,  x  =  166.65,  y=  60.12,  £"=1  yf  =  62,  184,  £"=i  xf  =  472,569). 
What  does  this  imply  with  respect  to  a  linear  regression  of  height  on  weight? 
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(b)  Now  estimate  and  interpret  the  linear  regression  model  where  “weight”  is  the 
outcome. 

(c)  Predict  the  weight  of  a  student  with  a  height  175  cm. 

(d)  Now  produce  a  scatter  plot  of  the  data  (manually  or  by  using  R)  and  interpret  it. 

(e)  Add  the  following  two  points  to  the  scatter  plot  (*18,  yis)  =  (175,  55)  and 
(*19,  yi9)  =  (150,  75).  Speculate  how  the  linear  regression  estimate  will  change 
after  adding  these  points. 

(f)  Re-estimate  the  model  using  all  19  observations  and  =  191,  696  and 

£>?  =  525,694. 

(g)  Given  the  results  of  the  two  regression  models :  What  are  the  general  implications 
with  respect  to  the  least  squares  estimator  of  /3? 

Exercise  11.4  To  study  the  association  of  the  monthly  average  temperature  (in  °C, 

X )  and  hotel  occupation  (in  %,  T),  we  consider  data  from  three  cities:  Polenca 

(Mallorca,  Spain)  as  a  summer  holiday  destination,  Davos  (Switzerland)  as  a  winter 

skiing  destination,  and  Basel  (Switzerland)  as  a  business  destination. 


Month 

Davos 

Polenca 

Basel 

Y 

X 

Y 

X 

Y 

Jan 

-6 

91 

10 

13 

1 

23 

Feb 

-5 

89 

10 

21 

0 

82 

Mar 

2 

76 

14 

42 

5 

40 

Apr 

4 

52 

17 

64 

9 

45 

May 

7 

42 

22 

79 

14 

39 

Jun 

15 

36 

24 

81 

20 

43 

Jul 

17 

37 

26 

86 

23 

50 

Aug 

19 

39 

27 

92 

24 

95 

Sep 

13 

26 

22 

36 

21 

64 

Oct 

9 

27 

19 

23 

14 

78 

Nov 

4 

68 

14 

13 

9 

9 

Dec 

0 

92 

12 

41 

4 

12 

(a)  Interpret  the  following  regression  model  output  where  the  outcome  is  “hotel 
occupation”  and  “temperature”  is  the  covariate. 

Estimate  Std.  Error  t  value  Pr (> I t | ) 
(Intercept)  50.33459  7.81792  6.438  2.34e-07  *** 

X  0.07717  0.51966  0.149  0.883 

(b)  Interpret  the  following  output  where  “city”  is  treated  as  a  covariate  and  “hotel 
occupation”  is  the  outcome. 

Estimate  Std.  Error  t  value  Pr (> I t | ) 
(Intercept)  48.3333  7.9457  6.083  7.56e-07  *** 

cityDavos  7.9167  11.2369  0.705  0.486 

cityPolenca  0.9167  11.2369  0.082  0.935 
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(c)  Interpret  the  following  output  and  compare  it  with  the  output  from  b): 

Analysis  of  Variance  Table 
Response:  Y 

Df  Sum  Sq  Mean  Sq  F  value  Pr(>F) 
city  2  450.1  225.03  0.297  0.745 

Residuals  33  25001.2  757.61 


(d)  In  the  following  multiple  linear  regression  model,  both  “city”  and  “tempera¬ 
ture”  are  treated  as  covariates.  How  can  the  coefficients  be  interpreted? 


Estimate  Std.  Error  t  value  Pr (> I t | ) 


(Intercept)  44.1731 
X  0.3467 
cityDavos  9.7946 
cityPolenca  -1.1924 


10.9949  4.018  0.000333  *** 

0.6258  0.554  0.583453 

11.8520  0.826  0.414692 

11.9780  -0.100  0.921326 


(e)  Now  consider  the  regression  model  for  hotel  occupation  and  temperature  fitted 
separately  for  each  city:  How  can  the  results  be  interpreted  and  what  are  the 
implications  with  respect  to  the  models  estimated  in  (a)-(d)?  How  can  the  models 
be  improved? 


Davos : 

Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

73.9397 

4.9462 

14.949 

3 . 61e-08 

*** 

X 

-2.6870 

0.4806 

-5.591 

0.000231 

*** 

Polenca: 

Estimate 

Std.  Error 

t  value 

Pr (> | t | ) 

(Intercept) 

-22.6469 

16.7849 

-1.349 

0.20701 

X 

3.9759 

0.8831 

4.502 

0.00114 

** 

Basel : 

Estimate 

Std.  Error 

t  value 

Pr (> | t | ) 

(Intercept) 

32.574 

13.245 

2.459 

0.0337 

* 

X 

1.313 

0.910 

1.443 

0.1796 

(f)  Describe  what  the  design  matrix  will  look  like  if  city,  temperature,  and  the 
interaction  between  them  are  included  in  a  regression  model. 

(g)  If  the  model  described  in  (f)  is  fitted  the  output  is  as  follows: 


Estimate  Std.  Error  t  value  Pr ( > I t | ) 


(Intercept) 

X 

cityDavos 
cityPolenca 
X : cityDavos 
X : cityPolenca 


32.5741 

1.3133 

41.3656 

-55.2210 

-4.0003 

2.6626 


10.0657  3.236  0.002950  ** 

0.6916  1.899  0.067230  . 

12.4993  3.309  0.002439  ** 

21.0616  -2.622  0.013603  * 

0.9984  -4.007  0.000375  *** 

1.1941  2.230  0.033388  * 


Interpret  the  results. 
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(h)  Summarize  the  association  of  temperature  and  hotel  occupation  by  city — 
including  95  %  confidence  intervals — using  the  interaction  model.  The  covari- 


ance  matrix  is  as  follows: 

(Int . ) 

X 

Davos 

Polenca 

X : Davos 

X : Polen 

(Int . ) 

101.31 

-5.73 

-101.31 

-101.31 

5.73 

5.73 

X 

-5.73 

0.47 

5.73 

5.73 

-0.47 

-0.47 

Davos 

-101.31 

5.73 

156.23 

101.31 

-9.15 

-5.73 

Polenca 

-101.31 

5.73 

101.31 

443.59 

-5.73 

-22.87 

X : Davos 

5.73 

-0.47 

-9.15 

-5.73 

0.99 

0.47 

X : Polenca 

5.73 

-0.47 

-5.73 

-22.87 

0.47 

1.42 

Exercise  11.5  The  theatre  data  (see  Appendix  A.4)  describes  the  monthly  expen¬ 
diture  on  theatre  visits  of  699  residents  of  a  Swiss  city.  It  captures  not  only  the 
expenditure  on  theatre  visits  (in  SFR)  but  also  age,  gender,  yearly  income  (in  1000 
SFR),  and  expenditure  on  cultural  activities  in  general  as  well  as  expenditure  on 
theatre  visits  in  the  preceding  year. 


(a)  The  summary  of  the  multiple  linear  model  where  expenditure  on  theatre  visits 
is  the  outcome  is  as  follows: 


Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

-127.22271 

19.15459 

-6.642 

6 . 26e-ll 

*** 

Age 

0.39757 

0.19689 

[1] 

[2] 

Sex 

22.22059 

5.22693 

4.251 

2.42e-05 

*** 

Income 

1.34817 

0.20947 

6.436 

2 . 29e-10 

*** 

Culture 

0.53664 

0.05053 

10.620 

<2e-16 

*** 

Theatre_ly 

0.17191 

0.11711 

1.468 

0.1426 

How  can  the  missing  values  [1]  and  [2]  be  calculated? 

(b)  Interpret  the  model  diagnostics  in  Fig.  11.11. 

(c)  Given  the  diagnostics  in  (b),  how  can  the  model  be  improved?  Plot  a  histogram 
of  theatre  expenditure  in  R  if  you  need  further  insight. 

(d)  Consider  the  model  where  theatre  expenditure  is  log- transformed: 

Coefficients : 


Estimate 

Std.  Error 

t  value 

Pr (> | t | ) 

(Intercept) 

2.9541546 

0.1266802 

23.320 

<  2e-16 

*** 

Age 

0.0038690 

0.0013022 

2.971 

0.00307 

** 

Sex 

0.1794468 

0.0345687 

5.191 

2 . 75e-07 

*** 

Income 

0.0087906 

0.0013853 

6.346 

4 . 00e-10 

*** 

Culture 

0.0035360 

0.0003342 

10.581 

<  2e-16 

*** 

Theatre_ly 

0.0013492 

0.0007745 

1.742 

0.08197 

• 

How  can  the  coefficients  be  interpreted? 
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11  Linear  Regression 


Theoretical  Quantiles 

lm(Theatre  ~  Age  +  Sex  +  Income  +  Culture  +  Theatrejy) 

(a)  QQ-plot 

Fig.  11.11  Checking  the  model  assumptions 


Theoretical  Quantiles 

lm(l(log(Theatre))  ~  Age  +  Sex  +  Income  +  Culture  +  Theatrejy) 

(a)  QQ-plot 

Fig.  11.12  Checking  the  model  assumptions 


(b)  Fitted  values  versus  residuals 


(b)  Fitted  values  vs.  residuals 


(e)  Judge  the  quality  of  the  model  from  d)  by  means  of  Figs.  11.12a  and  11.12b. 
What  do  they  look  like  compared  with  those  from  b)? 


Exercise  11.6  Consider  the  pizza  delivery  data  described  in  Appendix  A.4. 

(a)  Read  the  data  into  R.  Fit  a  multiple  linear  regression  model  with  delivery  time 
as  the  outcome  and  temperature,  branch,  day,  operator,  driver,  bill,  number  of 
ordered  pizzas,  and  discount  customer  as  covariates.  Give  a  summary  of  the 
coefficients. 
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(b)  Use  R  to  calculate  the  95  %  confidence  intervals  of  all  coefficients.  Hint:  the 
standard  errors  of  the  coefficients  can  be  accessed  either  via  the  covariance 
matrix  or  the  model  summary. 

(c)  Reproduce  the  least  squares  estimate  of  a2.  Hint:  use  residuals  ()  to  obtain 
the  residuals. 

(d)  Now  use  R  to  estimate  both  R2  and  R2d- .  Compare  the  results  with  the  model 
output  from  a). 

(e)  Use  backward  selection  by  means  of  the  st  epAIC  function  from  the  library  MASS 
to  find  the  best  model  according  to  AIC. 

(f)  Obtain  R2dj  from  the  model  identified  in  e)  and  compare  it  to  the  full  model 
from  a). 

(g)  Identify  whether  the  model  assumptions  are  satisfied  or  not. 

(h)  Are  all  variables  from  the  model  in  (e)  causing  the  delivery  time  to  be  either 
delayed  or  improved? 

(i)  Test  whether  it  is  useful  to  add  a  quadratic  polynomial  of  temperature  to  the 
model. 

(j)  Use  the  model  identified  in  (e)  to  predict  the  delivery  time  of  the  last  captured 
delivery  (i.e.  number  1266).  Use  the  predictO  command  to  ease  the  calculation 
of  the  prediction. 


Solutions  of  all  exercises  in  this  chapter  can  be  found  on  p.  409 


*  Source  Toutenburg,  H.,  Heumann,  C.,  Deskriptive  Statistik ,  7th  edition,  2009, 
Springer,  Heidelberg 
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Background 

The  open-source  software  R  was  developed  as  a  free  implementation  of  the  language 
S  which  was  designed  as  a  language  for  statistical  computation,  statistical  program¬ 
ming,  and  graphics.  The  main  intention  was  to  allow  users  to  explore  data  in  an 
easy  and  interactive  way,  supported  by  meaningful  graphical  representations.  The 
statistical  software  R  was  originally  created  by  Ross  Ihaka  and  Robert  Gentleman 
(University  of  Auckland,  New  Zealand). 


Installation  and  Basic  Functionalities 

•  The  “base”  R  version,  i.e.  the  software  with  its  most  relevant  commands,  can 
be  downloaded  from  https://www.r-project.org/.  After  installing  R,  it  is  recom¬ 
mended  to  install  an  editor  too.  An  editor  allows  the  user  to  conveniently  save  and 
display  R-code,  submit  this  code  to  the  R  console  (i.e.  the  R  software  itself),  and 
control  the  settings  and  the  output.  A  popular  choice  of  editor  is  R Studio  (free  of 
charge)  which  can  be  downloaded  from  https://www.rstudio.com/  (see  Fig.  A.l 
for  a  screenshot  of  the  software).  There  are  alternative  good  editors,  for  example 
4  ‘Tinn-R’  ’  (http : // s ourcef orge .  net/proj  ects/tinn-  r/) . 

•  A  lot  of  additional  user-written  packages  are  available  online  and  can  be 
installed  within  the  R  console  or  using  the  R  menu.  Within  the  console,  the 
install  .packages ("package  to  install")  function  can  be  used.  Please 
note  that  an  internet  connection  is  required. 


©  Springer  International  Publishing  Switzerland  2016 
C.  Heumann  et  al.,  Introduction  to  Statistics  and  Data  Analysis , 
DOI  10.1007/978-3-319-46162-5 
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Statistics  has  a  close  relationship  to  algebra:  data  sets  can  be  seen  as  matrices,  and 
variables  as  vectors.  R  makes  use  of  these  structures  and  this  is  why  we  first  intro¬ 
duce  data  structure  functionalities  before  explaining  some  of  the  most  relevant  basic 
statistical  commands. 

R  as  a  Calculator,  Basic  Data  Structures  and  Arithmetic 
Operations 

•  The  character  #  marks  the  beginning  of  a  comment.  All  characters  until  the  end  of 
the  line  are  ignored  by  R.  We  use  #  to  comment  on  our  R -code. 

•  If  we  know  the  name  of  a  command  we  would  like  to  use,  and  we  want  to  learn 
about  its  functionality,  typing  ? command  in  the  R  command  line  prompt  displays 
a  help  page,  e.g. 

?mean 

displays  a  help  page  for  the  arithmetic  mean  function. 

•  Using 

example (mean) 

shows  application  examples  of  the  respective  function. 

•  The  command  c(l,2,3,4,5)  combines  the  numbers  1,  2,  3,  4  and  5  into  a  vector 
(e.g.  a  variable). 

•  Vectors  can  be  assigned  to  an  “object”.  For  example, 

X  <-  c(2, 12,22,32) 

assigns  a  numeric  vector  of  length  4  to  the  object  X.  In  general,  the  arrow  sign 
(<— )  is  a  very  important  concept  to  store  data,  summaries,  and  outputs  in  objects 
(i.e.  the  name  in  the  front  of  the  <—  sign).  Note  that  R  is  case  sensitive;  i.e.  X  and 
x  are  two  different  variable  names.  Instead  of  one  can  also  use  “=”. 

•  Sequences  of  numbers  can  be  created  using  the  seq  and  rep  commands.  For 
example, 

seq(l , 10) 


and 


rep(l , 10) 


yield 


and 


[1]  1  2  3  4  5  6  7  8  9  10 


[1]  1111111111 


respectively. 
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•  Basic  data  structures  are  vectors,  matrices,  arrays,  lists,  and  data  frames.  They 
can  contain  numeric  values,  logical  values  or  even  characters  (strings).  In  the  latter 
case,  arithmetic  operations  are  not  allowed. 

-  A  numeric  vector  of  length  5  can  be  constructed  by  the  command 

x  <-  vector (mode= "numeric" ,  length=5) 

The  elements  can  be  accessed  by  squared  brackets:  [  ].  For  example, 
x [3]  <-  4 

assigns  the  value  4  to  the  third  element  of  the  object  x.  Logical  vectors  containing 
the  values  TRUE  and  FALSE  are  also  possible.  In  arithmetic  operations,  TRUE 
corresponds  to  1  and  FALSE  corresponds  to  0  (which  is  the  default).  Consider 
the  following  example: 

x.log  <-  vector (mode="logical" ,  length=4) 
x.log[l]  =  x.log  [3]  =  x.log  [4]  =  TRUE 
mean(x . log) 

returns  as  output  0 . 75  because  the  mean  of  (1 ,  0,  1 ,  1)  =(TRUE,  FALSE,  TRUE, 
TRUE)  is  0.75. 

-  A  matrix  can  be  constructed  by  the  matrix  0  command: 

x  <-  matrix (nrow=4,  ncol=2,  data=l:8,  byrow=T) 

creates  a  4  x  2  matrix,  where  the  data  values  are  the  natural  numbers  1,2,3,..., 
8  which  are  stored  row- wise  in  the  matrix, 

[,1]  [  >  2] 

[1,]  1  2 

[2,]  3  4 

[3,]  5  6 

[4,]  7  8 

because  of  the  parameter  byrow=T  (which  is  equivalent  to  byrow=TRUE).  The 
default  is  byrow=F  which  would  store  the  data  column-wise. 

-  Arrays  are  more  general  data  structures  than  vectors  and  matrices  in  the  sense 
that  they  can  have  more  than  two  dimensions.  For  instance, 

x  <-  array (data=l : 12 ,  dim=c (3 , 2 , 2) ) 

creates  a  three-dimensional  array  with  3*2*2=  12  elements. 

-  A  list  can  contain  objects  of  different  types.  For  example,  a  list  element  can  be 
a  vector  or  matrix.  Lists  can  be  initialized  by  the  command  list  and  can  grow 
dynamically.  It  is  important  to  understand  that  list  elements  should  be  accessed 
by  the  name  of  the  entry  via  the  dollar  sign  or  using  double  brackets: 
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x  <-  list(one=c(l,2,3,4,5) ,two=c("Hello" ,  "world",  "!")) 
x 

$one 

[1]  1  2  3  4  5 
$two 

[1]  "Hello"  "world"  "!" 
x [ [2] ] 

[1]  "Hello"  "world"  "!" 
x$one 

[1]  1  2  3  4  5 

-  A  data  frame  is  the  standard  data  structure  for  storing  a  data  set  with  rows  as 
observations  and  columns  as  variables.  Many  statistical  procedures  in  R  (such 
as  the  lm  function  for  linear  models)  expect  a  data  frame.  A  data  frame  is 
conceptually  not  much  different  from  a  matrix  and  can  either  be  initialized  by 
reading  a  data  set  from  an  external  file  or  by  binding  several  column  vectors. 
As  an  example,  we  consider  three  variables  (age,  favourite  hobby,  and  favourite 
animal)  each  with  five  observations: 

age  <-  c (25, 33, 30, 40, 28) 

hobby  <-  c("Reading" , "Sports" , "Games" , "Reading" , "Games") 
animal  <-  c("Elephant " ,  "Giraffe",  NA,  "Monkey",  "Cat") 
dat  <-  data. frame (age , hobby , animal) 

names (dat)  <-  c("Age" , "Favourite .hobby" , "Favourite . animal") 
dat 

The  resulting  output  is 


>  dat 


Age 

Favourite . hobby 

Favourite . animal 

1 

25 

Reading 

Elephant 

2 

33 

Sports 

Giraffe 

3 

30 

Games 

<NA> 

4 

40 

Reading 

Monkey 

5 

28 

Games 

Cat 

where  NA  denotes  a  missing  value.  With  write  .table  or  a  specialized  version 
thereof  such  as  wr  ite  .  csv  (for  writing  the  data  in  a  file  using  comma-separated 
fields),  a  data  frame  can  be  saved  in  a  file.  The  command  sequence 


write . csv(x=dat ,f ile="toy . csv" ,row.names=FALSE) 
read.dat  <-  read. csv(file="toy. csv") 
read.dat 
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saves  the  data  frame  as  an  external  (comma- separated)  file  and  then  loads  the 
data  again  using  read .  csv. 

Individual  elements  of  the  data  frame  can  be  accessed  using  squared  brackets,  as 
for  matrices.  For  example,  dat  [1,2]  returns  the  first  observation  of  the  second 
variable  column  for  the  data  set  dat.  Individual  columns  (variables)  can  also  be 
selected  using  the  $  sign: 

dat $ Age 

returns  the  age  column: 

[1]  25  33  30  40  28 


•  The  factor  command  is  very  useful  to  store  nominal  variables,  and  the  command 
ordered  is  ideal  for  ordinal  variables.  Both  commands  are  extremely  important 
since  factor  variables  with  more  than  two  categories  are  automatically  expanded 
into  several  columns  of  dummy  variables  if  necessary,  e.g.  if  they  are  included  as 
covariates  in  a  linear  model.  In  the  previous  paragraph,  two  factor  variables  have 
already  been  created.  This  can  be  confirmed  by  typing 

is .factor (dat$Favourite . hobby) 
is . factor (dat$Favourite . animal) 


which  return  the  value  TRUE.  Have  a  look  at  the  following  two  factor  variables: 

sex  <-  factor ("female" , "male" , "male" , "female" , "female") 
grade  <-  ordered(c("low" ,  "medium",  "low",  "high",  "high"), 
levels=c("low" ,  "medium" , "high") ) 


Please  note  that  by  default  alphabetical  order  is  used  to  order  the  categories  (e.g. 
female  is  coded  as  1  and  male  as  2).  However,  the  mapping  of  integers  to  strings 
can  be  controlled  by  the  user  as  seen  for  the  variable  “grade”: 

grade 


returns 


[1]  low  medium  low  high  high 
Levels:  low  <  medium  <  high 
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•  Basic  arithmetic  operations  can  be  applied  directly  to  a  numeric  vector.  Basic 
operations  are  addition  +,  subtraction  — ,  multiplication  *  and  division  /,  inte¬ 
ger  division  %/%,  modulo  operation  %%,  and  exponentiation  with  two  possible 
notations:  **  or  Examples  are  given  as: 


2~3 

# 

command 

[1]  8 

# 

output 

2**3 

# 

command 

[1]  8 

# 

output 

2~0 . 5 

# 

command 

[1]  1.414214 

# 

output 

c (2 , 3 , 5 , 7) ~2 

# 

command:  application  to  a  vector 

[1]  4  9  25  49 

# 

output 

c  (2 , 3 , 5 , 7)  ~c  (2 , 3) 

# 

command:  !!!  ATTENTION! 

[1]  4  27  25  343 

# 

output 

c(l,2,3,4,5,6)~c(2,3,4)  #  command 

[1]  1  8  81  16  125  1296  #output 

c(2,3,5,7)~c(2,3,4)  #  command:  !!!  WARNING  MESSAGE! 

[1]  4  27  625  49 

Warning  message: 
longer  object  length 

is  not  a  multiple  of  shorter  object  length 
in:  c(2,  3,  5,  7)~c(2,  3,  4) 

The  last  four  commands  show  the  “recycling  property”  of  R.  It  tries  to  match  the 
vectors  with  respect  to  the  length  if  possible.  In  fact, 

c  (2 , 3 , 5 , 7)  ~c  (2 , 3) 

is  expanded  to 

c(2,3,5,7)~c(2,3,2,3) 

The  last  example  shows  that  R  gives  a  warning  if  the  length  of  the  shorter  vector 
cannot  be  expanded  to  the  length  of  the  longer  vector  by  a  simple  multiplication 
with  a  natural  number  (2,  3,  4,  . . .).  Here 

c(2,3,5,7)~c(2,3,4) 

is  expanded  to 


c(2,3,5,7)~c(2,3,4,2) 
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such  that  not  all  elements  of 

c(2,3,4) 

are  “recycled”. 

More  on  indexing 

The  standard  ways  of  accessing/indexing  elements  in  a  vector,  matrix,  list,  or  data 
frame  have  already  been  introduced  above,  but  R  allows  a  lot  more  flexible  accessing 
of  elements. 

1.  Selecting  elements  using  vectors  of  positive  numbers  (letters  and  LETTERS 
show  the  26  letters  of  the  alphabet) 

letters  [1 : 3] 
letters  [  c(2,4,6)  ] 

[1]  "a"  "b"  "c" 

[1]  "b"  "d"  "f" 

2.  Selecting  elements  using  logical  vectors 

x  <-  1 : 10  #  numbers  1  to  10 

x[  (x>5)  ]  #  selecting  any  number  >5 

x[  (x°/o°/o2==0)  ]  #  numbers  that  are  divisible  by  2 

x[(x°/o°/o2==l)]  #  numbers  that  are  not  divisible  by  2 

x[5]  <-  NA  #  5th  element  of  x  is  defined 

#  to  be  missing  (NA) 
x 

y  <-  x[!is.na(x)]  #  all  x  which  are  not  missing 

y 

returns  the  output 

[1]  6  7  8  9  10 

[1]  2  4  6  8  10 

[1]  1  3  5  7  9 

[1]  1  2  3  4  NA  6  7  8  9  10 

[1]  1  2  3  4  6  7  8  9  10 
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3.  Selecting  (deleting)  elements  using  negative  numbers 

x  <-  1:10 

x[-(l:5)]  #  x,  but  delete  first  five  entries  of  x 

returns  the  output 

[1]  6  7  8  9  10 

because  the  first  five  elements  have  been  removed. 

4.  Selecting  elements  using  characters 

x  <-  c(Water=l,  Juice=2,  Lemonade=3  ) 
names (x) 
x  ["Juice"] 

returns  the  output 

[1]  "Water"  "Juice"  "Lemonade" 

Juice 

2 


Standard  Functions 


Some  standard  functions  and  their  roles  in  R  are 


abs  () 
sqrt () 

roundO,  floorO,  ceilingO 
sum() ,  prodO 
log(),  loglOO,  log2 () 
exp() 

sin() ,  cos () ,  tan() , 
asinO,  acosO,  atan() 
sinhO,  coshO,  tanh(), 
asinh(x),  acoshQ,  atanh(x) 


Absolute  value 
Square  root 

Rounding,  up  and  down 
Sum  and  product 
Logarithms 
Exponential  function 
Trigonometric  functions 

Hyperbolic  functions 


All  functions  can  again  be  applied  directly  to  numeric  vectors. 
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Statistical  Functions 

Some  statistical  functions  and  their  roles  in  R  are 

me  an  () ,  var  ( )  Mean  and  variance 
c  o v  ( ) ,  c  or  ( )  Covariance  and  correlation 

min  ( ) ,  max  ( )  Minimum  and  maximum 

Note :  the  arguments  of  the  functions  vary  depending  on  the  chosen  method.  For 
example,  the  meanO  function  can  be  applied  to  general  R  objects  where  averaging 
makes  sense  (numeric  or  logical  vectors,  but  also,  e.g.  matrices).  The  functions 
var() ,  cov() ,  cor()  expect  one  or  two  numeric  vectors,  matrices,  or  data  frames. 
Minimum  and  maximum  functions  work  also  with  a  comma-separated  list  of  values, 
i.e. 


min(2,  6.7,  1.2,  8.0) 
provides  the  same  result  (1 . 2)  as 

min(c(2,  6.7,  1.2,  8.0)) 
Examples: 

mean(  c(l,2,5,6)  ) 

[1]  3.5 


var(  c(l,2,5,6)  ) 

[1]  5.666667 

Note  that  var  () ,  cov()  use  the  factor  1  /(n  —  1)  for  the  unbiased  estimate  of  the 
variance  instead  of  1  /n  for  the  empirical  variance,  i.e.  1/3  in  the  example  above. 
Both  functions  can  also  be  applied  to  several  vectors  or  a  matrix.  Then  the  covariance 
matrix  (and  correlation  matrix  in  case  of  cor  () )  is  computed.  For  example,  consider 
two  variables 

age.v  <-  c (25, 30, 35, 40) 

income. v  <-  c(2000,  2500,  2800,  3200) 

Then  both  commands  return  the  symmetric  covariance  matrix  (with  variances  as  the 
diagonal  entries  and  covariances  as  the  non-diagonal  entries). 
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var (cbind(age . v,  income. v)) 

age . v  income . v 
age.v  41.66667  3250.0 
income. v  3250.00000  255833.3 


cov(cbind(age . v,  income. v) ) 

age . v  income . v 
age.v  41.66667  3250.0 
income. v  3250.00000  255833.3 


The  (Pearson)  correlation  between  the  two  variables  is  calculated  as  0.9954293. 

cor (cbind(age . v,  income. v)) 

age . v  income . v 
age.v  1.0000000  0.9954293 
income. v  0.9954293  1.0000000 

The  Spearman  rank  correlation  is  perfectly  1,  since  both  vectors  are  in  increasing 
order: 

cor (cbind(age . v,  income. v),  method=" spearman") 
age . v  income . v 
age .v  1  1 

income. v  1  1 


More  Useful  Functions 

Some  more  commonly  used  standard  functions  and  their  roles  in  R  are  as  follows: 
•  Cumulative  sum  and  product: 

x  <-  c(  1,3,  2,  5) 

cumsum(x)  #  1,  1+3,  1+3+2,  1+3+2+5 

cumprod(x)  #  1,  1*3,  1*3*2,  1*3*2*5 

give  the  output 

[1]  14  6  11 

[1]  1  3  6  30 
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•  Factorial: 

factorial (5) 

returns  5 !  as 
[1]  120 

•  Binomial  coefficient  Q) : 

choose(4,2) 

returns  (^)  as 
[1]  6 


Mathematical  Constants 

The  number  7r  is  a  mathematical  constant,  the  ratio  of  a  circle’s  circumference  to  its 
diameter,  and  is  commonly  approximated  as  3.14159.  It  is  directly  available  in  R  as 
pi. 


Pi 

[1]  3.141593 


Other  “constants”  are 

Inf,  -Inf 

NaN 

NA 

NULL 


oo,  — oo 

Not  a  Number:  e.g.  0/0[l]  NaN 
Not  Available:  missing  values 
empty  set 


Assignment  Operator  for  Creating  Functions 

The  assignment  operator  <—  (“less  than”  sign  followed  by  hyphen)  has  already  been 
introduced  above  in  the  context  of  variables.  Alternatively,  =  (equality  sign)  can  be 
used.  One  can  create  one’s  own  functions:  the  function  is  an  object  with  a  name  which 
takes  values  specified  in  the  round  brackets  and  returns  what  has  been  specified  in  the 
curly  braces.  For  example,  the  following  function  myf  unction  returns  a  polynomial 
of  degree  3  for  a  given  argument  v.  Note  that  by  default  all  four  coefficients  are  equal 
to  1. 
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my. function  <-  function(x,a=l,b=l,c=l,d=l){ 
h  <-  a+b*x+c*x~2+d*x~3 
return (h) 

} 


my . function(2) 
[1]  15 


my . function (x=2 ,  a=4,  b=3) 
[1]  22 


Loops  and  Conditions 

The  concept  of  loops  is  convenient  when  some  operation  has  to  be  repeated.  Loops 
can  be  utilized  in  various  ways,  for  example,  via  for  or  while.  Conditions  are 
specified  with  the  if  statement.  For  example, 

x  <-  1:10 

for(i  in  1 : 10) L  if  (x [i]  >5) {x [i]  <-  x[i]+i} 

} 

x 


returns 

[1]  1  2  3  4  5  12  14  16  18  20 

In  this  example,  r  is  a  vector  with  10  elements:  1,  2,  3,  4,  5,  6,  7,  8,  9,  10.  For 
each  element  x  [i] ,  we  replace  x  [i]  with  x  [i]  +i  if  the  condition  x  [i]  >5  is  true; 
otherwise  we  do  not. 


Statistical  Functions 

Now  we  consider  some  basic  statistical  functions  in  R.  For  illustration,  we  use 
the  painters  data  in  the  following  example.  This  data  is  available  after  loading  the 
library  MASS  (only  a  subset  is  shown  below).  The  data  lists  the  subjective  assessment, 
on  a  0  to  20  integer  scale,  of  54  classical  painters.  The  painters  were  assessed  on 
four  characteristics:  composition,  drawing,  colour,  and  expression.  The  data  is  due 
to  the  eighteenth-century  art  critic,  de  Piles.  Use  ?painters  for  more  information 
on  the  data  set. 


library (MASS) 
painters 
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shows 


Composition  Drawing  Colour  Expression  School 


Da  Udine 

10 

8 

16 

3 

A 

Da  Vinci 

15 

16 

4 

14 

A 

Del  Piombo 

8 

13 

16 

7 

A 

Del  Sarto 

12 

16 

9 

8 

A 

Fr.  Penni 

0 

15 

8 

0 

A 

The  Summary  Function 


The  summary  function  allows  a  quick  overview  of  a  data  frame.  For  numeric  vari¬ 
ables,  the  five-point  summary  (which  is  also  used  in  a  simple  box  plot,  see  Sect.  3.3) 
is  calculated  together  with  the  arithmetic  mean.  For  factor  variables,  the  absolute 
frequencies  of  the  most  frequent  categories  are  printed.  If  the  factor  has  more  than 
six  categories,  the  other  categories  are  summarized  in  a  separate  category — Other. 

summary (painters) 


yields 


Composition  Drawing 


Min. 

:  0.00 

Min. 

:  6.00 

1st  Qu. 

:  8.25 

1st  Qu. 

: 10.00 

Median 

: 12.50 

Median 

: 13.50 

Mean 

: 11 . 56 

Mean 

: 12 . 46 

3rd  Qu. 

: 15 . 00 

3rd  Qu. 

: 15.00 

Max . 

: 18 . 00 

Max . 

: 18.00 

School 


A 

D 

E 

G 

B 

C 

(Other) 


10 

10 

7 

7 
6 
6 

8 


The  summary  function  can  also  be  applied  to  a  single  variable: 

summary (painters$School) 


returns 

ABCDEFGH 
10  6  6  10  7  4  7  4 
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Accessing  Subgroups  in  Data  Frames 

Subgroups,  i.e.  groups  of  observations  which  share  the  same  feature(s)  of  one  or 
several  variables,  can  be  accessed  using  the  subset  command. 


subset (painters ,  School=="F") 

accesses  all  painters  for  which  School== c  CF’  ’  holds. 


Composition  Drawing  Colour  Expression  School 


Durer 

8 

10 

10 

8 

F 

Holbein 

9 

10 

16 

13 

F 

Pourbus 

4 

15 

6 

6 

F 

Van  Leyden 

8 

6 

6 

4 

F 

This  is  a  more  elegant  method  than  selecting  these  observations  by  specifying  a 
condition  in  squared  brackets  via  the  [rows ,  columns]  argument. 

painters [  painters [ ["School"] ]  ==  "F" ,  ] 


Note  that  the  explained  structure  is  an  important  one:  we  access  the  rows  and  columns 
of  a  matrix  or  data  set  by  using  the  [rows ,  columns]  argument.  Here  we  access  all 
rows  for  which  the  variable  “school”  is  “F”.  If,  in  addition,  we  also  want  to  restrict 
the  data  set  to  the  first  two  variables,  we  can  write: 

painters [  painters [ ["School"] ]  ==  "F" ,  c(l,2)] 

Similarly, 

subset (painters ,  Composition  <=  6) 


gives  the  output 

Composition  Drawing  Colour  Expression  School 


Fr.  Penni  0 
Perugino  4 
Bassano  6 
Bellini  4 
Murillo  6 
Palma  Vecchio  5 
Caravaggio  6 
Pourbus  4 


15 

8 

0 

A 

12 

10 

4 

A 

8 

17 

0 

D 

6 

14 

0 

D 

8 

15 

4 

D 

6 

16 

0 

D 

6 

16 

0 

E 

15 

6 

6 

F 

Uninteresting  columns  can  be  eliminated  using  negative  indexing.  For  instance, 
in  the  following  example, 
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subset (painters ,  School=="F",  select=c (-3 , -5)  ) 


Composition  Drawing  Expression 


Durer 

8 

10 

8 

Holbein 

9 

10 

13 

Pourbus 

4 

15 

6 

Van  Leyden 

8 

6 

4 

the  third  and  the  fifth  columns  (Colour  and  School)  are  not  shown. 

The  operator  0/0in0/0  allows  for  more  complex  searches.  For  instance, 

subset  (painters ,  Drawing  70in°/0  c(6,7,8,9)  &  Composition==10) 
returns  the  following  output: 


Composition 

Drawing  Colour  Expre 

ssion 

School 

Da  Udine 

10 

8 

16 

3 

A 

J.  Jordaens 

10 

8 

16 

6 

G 

Bourdon 

10 

8 

8 

4 

H 

i.e.  those  painters  with  a  drawing  score  between  6  and  9  (=  any  number  which  matches 
6,  or  7,  or  8,  or  9). 

Stratifying  a  Data  Frame  and  Applying  Commands  to  a  List 

Sometimes  it  is  of  interest  to  apply  statistical  commands  (such  as  summary)  to  several 
subgroups.  If  this  is  the  case,  the  data  is  partitioned  into  different  groups  using  split 
and  then  lapply  applies  a  function  to  each  of  these  groups.  The  command  split 
partitions  the  data  set  by  values  of  a  specific  variable.  For  example,  we  first  stratify 
the  painters  data  with  respect  to  the  painter’s  school: 

splitted  <-  split (painters ,  painters$School) 
splitted 


$A 


Composition 

Drawing  Colour  Expre 

ssion 

School 

Da  Udine 

10 

8 

16 

3 

A 

Da  Vinci 

15 

16 

4 

14 

A 

Del  Piombo 

8 

13 

16 

7 

A 

Del  Sarto 

12 

16 

9 

8 

A 

Fr.  Penni 

0 

15 

8 

0 

A 

Guilio  Romano 

15 

16 

4 

14 

A 

Appendix  A:  Introduction  to  R 


313 


Michelangelo 

8 

17 

4 

8 

A 

Perino  del  Vaga 

15 

16 

7 

6 

A 

Perugino 

4 

12 

10 

4 

A 

Raphael 

17 

18 

12 

18 

A 

$B 


Composition 

Drawing  Colour  Expre 

ssion 

School 

F .  Zucarro 

10 

13 

8 

8 

B 

Fr.  Salviata 

13 

15 

8 

8 

B 

Parmigiano 

10 

15 

6 

6 

B 

Primaticcio 

15 

14 

7 

10 

B 

T .  Zucarro 

13 

14 

10 

9 

B 

Volterra 

12 

15 

5 

8 

B 

$C 


Note,  that  split  ted  is  now  a  list, 

is . list (splitted) 

returns 

[1]  TRUE 

while  the  objects  splitted$A  to  splitted$H  are  data  frames. 


is . data. frame (splitted$A) 


returns 

[1]  TRUE 

Secondly,  as  indicated  above,  the  command  lapply  allows  us  to  apply  a  function  to 
a  list.  For  instance, 

lapply (splitted,  summary) 

applies  the  summary  function  to  all  data  frames  in  the  list  splitted  (output  not 
shown).  See  also  ? apply,  Tsapply,  Ttapply,  and  Tmapply  for  similar  opera¬ 
tions. 
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Sorting,  Ranking,  Finding  Duplicates,  and  Unique  Values 

•  Sorting  a  vector: 

x  <-  c(  1,3,  2,  5) 
sort (x) 

sort(x,  decreasing=TRUE) 

returns  the  ordered  values  in  decreasing  order  as 
[1]  5  3  2  1 

See  also  the  command  order  for  showing  the  order  of  vector  elements. 

•  Calculating  ranks; 

x  <-  c (  10,30,  20,  50,  20) 
rank(x) 

returns  the  following  output: 

[1]  1.0  4.0  2.5  5.0  2.5 

•  Finding  duplicate  values: 

x  <-  c (  1,3,  2,  5,  2) 
duplicated(x) 

indicates  which  values  occur  more  than  once: 

[1]  FALSE  FALSE  FALSE  FALSE  TRUE 

•  Removing  duplicates: 

x  <-  c (  1,3,  2,  5,  2) 
unique (x) 


shows  the  output  as 

[1]  13  2  5 


This  means  unique  finds  out  how  many  different  values  a  vector  has. 


Appendix  A:  Introduction  to  R 


315 


Categorizing  Numeric  Variables 

Continuous  variables  (vectors)  can  be  categorized  using  the  cut  command. 

x  <-  c(1.3,  1.5,  2.5,  3.8,  4.1,  5.9,  7.1,  8.4,  9.0) 

xdiscrete  <-  cut(x,  breaks=c (-Inf ,  2,  5,  8,  Inf)  ) 

is .factor (xdiscrete) 

xdiscrete 

table (xdiscrete) 

returns 

[1]  TRUE 

[1]  (-Inf, 2]  (-Inf, 2]  (2,5]  (2,5]  (2,5]  (5,8]  (5,8]  (8, Inf] 

[9]  (8, Inf] 

Levels:  (-Inf, 2]  (2,5]  (5,8]  (8, Inf] 

(-Inf, 2]  (2,5]  (5,8]  (8, Inf] 

2  3  2  2 

Random  Variables 

•  R  has  built-in  functions  for  several  probability  density /mass  functions  (PMF/PDF), 
(probability)  distribution  function  (i.e.  the  CDF),  quantile  functions  and  for  gen¬ 
erating  random  numbers. 

•  The  function  names  use  the  following  scheme: 


First  letter  Function 

Further  letters 

d 

density 

distribution 

name 

P 

probability 

distribution 

name 

q 

quantiles 

distribution 

name 

r 

random  number  distribution 

name 

•  Examples: 

dnorm(x=0) 

[1]  0.3989423 

returns  the  value  of  the  density  function  (i.e.  P(X  =  v))  of  a  N(0,  1) -distribution 
at  v  =  0,  which  is  1 

pnorm(q=0) 
pnorm(q=l . 96) 

[1]  0.5 

[1]  0.9750021 
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returns  the  value  of  the  CDF  of  a  N(0,  1 ) -distribution  at  q,  i.e.  <P(q)  =  P(X  <  q). 

qnorm(p=0 . 95) 

returns  the  value 
[1]  1.644854 

which  is  the  95  %  quantile  of  a  N(0,  1) -distribution. 

X  <-  rnorm(n=4) 

X 

returns  a  vector  of  four  normal  random  numbers  of  a  N(0,  1) -distribution: 

[1]  -0.90826678  -0.09089654  -0.47679821  1.22137230 

Note  that  a  repeated  application  of  this  function  leads  to  different  random  numbers. 
To  get  a  reproducible  sequence  of  random  numbers,  a  seed  value  needs  to  be  set: 


set . seed(89234512) 

X  <-  rnorm(n=4) 

X 

If  all  three  commands  are  executed,  then  the  sequence  is  (using  the  standard  random 
generator) 

[1]  -1.07628865  0.37797715  0.04925738  -0.22137107 

•  The  following  functions  for  distributions  can  be  used: 

Model  distributions 


Function 

Distribution 

beta 

Beta 

binom 

Binomial 

cauchy 

Cauchy 

exp 

Exponential 

gamma 

Gamma 

geom 

Geometric 

hyper 

Hypergeometric 

lnorm 

Log-normal 

norm 

Normal 

pois 

Poisson 

unit 

Uniform 

mvnorm 

Multivariate  normal  (in  package  mvtnorm) 
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Test  distributions 


Function 

Distribution 

chisq 

x2 

f 

F 

signrank  Wilcoxon  signed  rank  (1  sample) 

t 

t 

wilcox 

Wilcoxon  rank  sum  (2  samples) 

•  For  convenience,  we  list  a  few  important  PDF  and  CDF  values  in  Sect.  C. 


Key  Points  and  Further  Issues 


Note: 


/  R  uses  the  following  data  structures:  vectors,  matrices,  arrays,  lists,  and 
data  frames. 

/  Entries  of  matrices  and  data  frames  can  be  accessed  using  squared  brack¬ 
ets.  For  example,  data  [1 : 5 , 2]  refers  to  the  first  five  observations  of 
the  second  column  (variable)  of  the  data.  Variables  of  data  frames  can 
also  be  accessed  via  the  $  sign,  e.g.  via  data$variable. 

/  If  operations  such  as  statistical  analyses  have  to  be  repeated  on  several 
subgroups,  using  split  together  with  lapply  is  a  viable  option.  Alter¬ 
natively,  loops  (e.g.  for)  together  with  conditions  (such  as  if)  can  be 
used. 

/  R  contains  the  most  relevant  statistical  functions  needed 
for  descriptive  and  inductive  analyses  (as  shown  through¬ 
out  the  book).  User- written  packages  can  be  installed  using 
install . packages ( f c  package_name 3  . 

/  Readers  who  are  interested  in  learning  more  about  R  are  referred  to 
Albert  and  Rizzo  (2012),  Crawley  (2013),  Dalgaard  (2008),  Ligges 
(2008),  and  Everitt  and  Hothorn  (2011). 
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Data  Sets 

From  the  data  used  in  this  book,  we  publish  some  relevant  data  sets,  along  with 
solutions  of  the  7?-exercises,  on  https://chris.userweb.mwn.de/book/.  The  important 
data  sets  are  explained  in  the  following  paragraphs. 


Pizza  Delivery  Data 

The  pizza  delivery  data  (pizza_delivery .  csv,  see  also  Table  A.l)  is  a  simulated 
data  set.  The  data  refers  to  an  Italian  restaurant  which  offers  home  delivery  of  pizza. 
It  contains  the  orders  received  during  a  period  of  one  month:  May  2014.  There  are 
three  branches  of  the  restaurant.  The  pizza  delivery  is  centrally  managed:  an  operator 
receives  a  phone  call  and  forwards  the  order  to  the  branch  which  is  nearest  to  the 
customer’s  address.  One  of  the  five  drivers  (two  of  whom  only  work  part  time  at  the 
weekend)  delivers  the  order.  The  data  set  captures  the  number  of  pizzas  ordered  as 
well  as  the  final  bill  (in  €)  which  may  also  include  drinks,  salads,  and  pasta  dishes. 
The  owner  of  the  business  observed  an  increased  number  of  complaints,  mostly 
because  pizzas  arrive  too  late  and  too  cold.  To  improve  the  service  quality  of  his 
business,  the  owner  wants  to  measure  (i)  the  time  from  call  to  delivery  and  (ii)  the 
pizza  temperature  at  arrival  (which  can  be  done  with  a  special  device).  Ideally,  a  pizza 
arrives  within  30  min  of  the  call;  if  it  takes  longer  than  40  min,  then  the  customers 
are  promised  a  free  bottle  of  wine  (which  is  not  always  handed  out  though).  The 
temperature  of  the  pizza  should  be  above  65  °C  at  the  time  of  delivery.  The  analysis  of 
the  data  aims  to  determine  the  factors  which  influence  delivery  time  and  temperature 
of  the  pizzas. 


Table  A.l  First  few  rows  of  the  pizza  delivery  data 


day 

date 

time 

operator 

branch 

driver 

temperature 

1  Thursday 

1 

May  2014 

35.1 

Laura 

East 

Bruno 

68.3 

2  Thursday 

1 

May  2014 

25.2 

Melissa 

East 

Salvatore 

71.0 

3  Thursday 

1 

May  2014 

45.6 

Melissa 

West 

Salvatore 

53.4 

4  Thursday 

1 

May  2014 

29.4 

Melissa 

East 

Salvatore 

70.3 

5  Thursday 

1 

May  2014 

30.0 

Melissa 

West 

Salvatore 

71.5 

6  Thursday 

1 

May  2014 

40.3 

Melissa 

Centre 

Bruno 

60.8 

bill 

1  58.4 

2  26.4 

3  58.1 

4  35.2 

5  38.4 

6  61.8 


pizzas  free_wine  got_wine  discount_customer 


4  0  0 

2  0  0 

3  10 

3  0  0 

2  0  0 

4  11 


1 

0 

0 

0 

0 

0 
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Table  A.2  First  few  rows  of  the  decathlon  data  from  the  2004  Olympic  Games  in  Athens  data 


100m  Long. jump  Shot. put  High. jump  400m 


Roman  Sebrle 

10.85 

7 

Bryan  Clay 

10.44 

7 

Dmitriy  Karpov 

10.50 

7 

Dean  Macey 

10.89 

7 

Chiel  Warners 

10.62 

7 

Attila  Zsivoczky 

10.91 

7 

110m. 

. hurdle 

Discus 

Roman  Sebrle 

14. 

.05 

48. 

.72 

Bryan  Clay 

14. 

.13 

50. 

.11 

Dmitriy  Karpov 

13. 

.97 

51. 

.65 

Dean  Macey 

14. 

.56 

48. 

.34 

Chiel  Warners 

14. 

.01 

43. 

.73 

Attila  Zsivoczky 

14. 

.95 

45. 

.62 

84 

16.36 

2.12 

48.36 

96 

15.23 

2.06 

49.19 

81 

15.93 

2.09 

46.81 

47 

15.73 

2.15 

48.97 

74 

14.48 

1.97 

47.97 

14 

15.31 

2.12 

49.40 

Pole . vault 

Javelin 

1500m 

5.0 

70.52 

280.01 

4.9 

69.71 

282.00 

4.6 

55.54 

278.11 

4.4 

58.46 

265.42 

4.9 

55.39 

278.05 

4.7 

63.45 

269.54 

Decathlon  Data 

This  data  (decathlon .  csv,  see  also  Table  A.2)  describes  the  results  of  the  decathlon 
competition  during  the  2004  Olympic  Games  in  Athens.  The  performance  of  all  30 
athletes  in  the  100  m  race  (in  seconds),  long  jump  (in  metres),  shot-put  (in  metres), 
high  jump  (in  metres),  400  m  race  (in  seconds),  110  m  hurdles  race  (in  seconds), 
discus  competition  (in  metres),  pole  vault  (in  metres),  javelin  competition  (in  metres), 
and  1500  m  race  (in  seconds)  are  recorded  in  the  data  set. 


Theatre  Data 

This  data  (theatre .  csv,  see  also  Table  A. 3)  summarizes  a  survey  conducted  on 
699  participants  in  a  big  Swiss  city.  The  survey  participants  are  all  frequent  visitors 
to  a  local  theatre  and  were  asked  about  their  age,  sex  (gender,  female  =1),  annual 
income  (in  1000  SFR),  general  expenditure  on  cultural  activities  (“Culture”,  in  SFR 
per  month),  expenditure  on  theatre  visits  (in  SFR  per  month),  and  their  estimated 
expenditure  on  theatre  visits  in  the  year  before  the  survey  was  done  (in  SFR  per 
month). 


Table  A.3  First  few  rows  of  the  theatre  data 


Age 

Sex 

Income 

Culture 

Theatre 

Theatre_ly 

1 

31 

1 

90.5 

181 

104 

150 

2 

54 

0 

73.0 

234 

116 

140 

3 

56 

1 

74.3 

289 

276 

125 

4 

36 

1 

73.6 

185 

75 

130 

5 

24 

1 

109.0 

191 

172 

140 

6 

25 

0 

93.1 

273 

168 

130 

Appendix:  Solutions  to  Exercises 


Solutions  to  Chapter  1 

Solution  to  Exercise  1.1 

(a)  The  population  consists  of  all  employees  of  the  airline.  This  may  include  admin¬ 
istration  staff,  pilots,  stewards,  cleaning  personnel,  and  others.  Each  single 
employee  relates  to  an  observation  in  the  survey. 

(b)  The  population  comprises  all  students  who  take  part  in  the  examination.  Each 
student  represents  an  observation. 

(c)  All  people  suffering  high  blood  pressure  in  the  study  area  (city,  province,  country, 
. . .),  are  the  population  of  interest.  Each  of  these  persons  is  an  observation. 

Solution  to  Exercise  1.2  The  population  in  this  study  refers  to  all  leopards  in  the 
park.  Only  a  few  of  the  leopards  are  equipped  with  the  GPS  devices.  This  is  the 
sample  on  which  the  study  is  conducted  in.  Each  leopard  refers  to  an  observation. 
The  measurements  are  taken  for  each  leopard  in  the  sample.  The  GPS  coordinates 
allow  to  determine  the  position  during  the  entire  day.  Important  variables  to  capture 
would  therefore  be  X\  =  “latitude”,  X^—  “longitude”,  and  X3  =  “time”.  Each 
variable  would  take  on  certain  values  for  each  observation;  for  example,  the  first 
leopard  may  have  been  observed  at  latitude  32°  at  a  certain  time  point,  and  thus 
x\\  =  32°. 
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Solution  to  Exercise  1.3 


Qualitative: 

Quantitative  and  discrete: 
Quantitative  and  continuous: 


Preferred  political  party,  eye  colour,  gender, 
blood  type,  subject  line  of  an  email. 

Shoe  size,  customer  satisfaction  on  a  scale  from 
1  to  10,  number  of  goals  in  a  hockey  match. 
Time  to  travel  to  work,  price  of  a  canteen  meal, 
wavelength  of  light,  delivery  time  of  a  parcel, 
height  of  a  child. 


Solution  to  Exercise  1.4 

(a)  The  choice  of  a  political  party  is  measured  on  a  nominal  scale.  The  names  of  the 
parties  do  not  have  a  natural  order. 

(b)  Typically  the  level  of  a  computer  game  is  measured  on  an  ordinal  scale:  for 
example,  level  10  may  be  more  difficult  than  level  5,  but  this  does  not  imply  that 
level  10  is  twice  as  difficult  as  level  5,  or  that  the  difference  in  difficulty  between 
levels  2  and  3  is  the  same  as  the  difference  between  levels  10  and  11. 

(c)  The  production  time  of  a  car  is  measured  on  a  continuous  scale  (ratio  scale).  In 
practice,  it  may  be  measured  in  days  from  the  start  of  the  production. 

(d)  This  variable  is  measured  on  a  continuous  scale  (ratio  scale).  Typically,  the  age 
is  captured  in  years  starting  from  the  day  of  birth. 

(e)  Calender  year  is  a  continuous  variable  which  is  measured  on  an  interval  scale. 
Note  that  the  year  which  we  define  as  “zero”  is  arbitrary,  and  it  varies  from 
culture  to  culture.  Because  the  year  zero  is  arbitrary,  and  we  also  have  dates 
before  this  year,  the  calender  year  is  measured  on  an  interval  scale. 

(f)  The  scale  is  continuous  (ratio  scale). 

(g)  The  scale  of  ID  numbers  is  nominal.  The  ID  number  may  indeed  consist  of 
numbers;  however,  “112233”  does  not  refer  to  something  half  as  much/good  as 
“224466”.  The  number  is  descriptive. 

(h)  The  final  rank  is  measured  on  an  ordinal  scale.  The  ranks  can  be  clearly  ordered, 
and  the  participants  can  be  ranked  by  using  their  final  results.  However  the  first 
winner  may  not  have  “double”  the  beauty  of  the  second  winner,  it  is  merely  a 
ranking. 

(i)  The  intelligence  quotient  is  a  variable  on  a  continuous  scale.  It  is  constructed 
in  such  a  way  that  differences  are  interpretative — i.e.  being  10  points  above  or 
10  points  below  the  average  score  of  100  points  means  the  same  deviation  from 
the  average.  However,  ratios  cannot  be  interpreted,  so  the  intelligence  quotient 
is  measured  on  an  interval  scale. 
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Solution  to  Exercise  1.5 


(a)  The  data  is  provided  in  .csv  format.  We  thus  read  it  in  with  the  read. csv() 
command  (after  we  have  set  a  working  directory  with  setwdO): 

setwd( ' C : /directory ' ) 

pizza  <-  read . csv ( 1 pizza_de livery . csv 1 ) 

(b)  The  data  can  be  viewed  by  means  of  the  f  ix  ()  or  View  ()  command  or  simply 
being  printed: 


fix (pizza) 

R 

pizza 

(c)  We  can  access  the  data,  as  for  any  matrix,  by  using  squared  brackets  [,  ],  see  also 
Appendix  A.l.  The  first  entry  in  the  brackets  refers  to  the  row  and  the  second 
entry  to  the  columns.  Each  entry  either  is  empty  (referring  to  every  row/column) 
or  consists  of  a  vector  or  sequence  describing  the  columns/rows  we  want  to  select. 
This  means  that  we  obtain  the  first  5  rows  and  variables  via  pizza  [1 : 5 , 1 : 5] . 
If  we  give  the  new  data  the  name  “pizza2”  we  simply  need  to  type: 

n 

pizza2  <-  pizza [1 : 5 , 1 : 5] 
pizza2 

We  can  save  this  new  data  either  as  a  .dat  file  (with  write  .table  ()),  or  as  a 
.csv  file  (with  write .  csv  ()),  or  directly  as  an  R  data  file  (with  save  ( ) )  which 
gives  us  access  to  our  entire  R  session. 

write . csv(pizza2 ,f ile= ' pizza2 . csv ' ) 
write. table (pizza2 ,f ile= 1 pizza2.dat ' ) 
save (pizza2 , f ile= ' pizza2 . Rdata ' ) 

(d)  We  can  access  any  variable  by  means  of  the  $  sign.  If  we  type  pizza$new  we 
create  a  new  variable  in  the  pizza  data  set  called  “new”.  Therefore,  a  simple  way 
to  add  a  variable  to  the  data  set  is  as  follows: 


pizza$NewTemperature  <-  32+1 . 8*pizza$temperature 


attach(pizza) 

R 

NewTemperature 
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>  str (pizza) 

' data . frame ' :  1266 

$  day 
$  dare 
$  rime 
$  operator 
S  branch 
S  driver 
$  temperature 
$  bill 
$  pizzas 
$  free_wine 
$  got_wine 
$  discount_customer 
$  NewTemperature 


obs.  of  13  variables: 

Factor  w/  7  levels  "Friday”, "Monday", :  5555. 
Factor  w/  31  levels  "Ol-May-14", "02-May-14", . . :  1 
num  35.1  25.2  45.6  29.4  30  ... 

Factor  w/  2  levels  "Laura", "Melissa" :  122222 
Factor  w/  3  levels  "Centre", "East", .. :  22323: 
Factor  w/  5  levels  "3runo", "Domenico", .. :  1555 


num 

num 

int 

int 

int 

int 

num 


68.3  71  53.4  70.3  71.5 

58.4  26.4  58.1  35.2  38 
4  3  2  2  5 
1  1  0  0  0, 

1  1  0  0  0, 

0  0  0  0  0, 


4  2  3  3  2 
0  0  10  0 
0  0  0  0  0 
1  0  0  0  0 


155  160  128  159  161 


4  61.8  57.9  35.8  36.6 


Fig.  B.1  Applying  str  ()  to  the  pizza  data 


(f)  We  can  apply  all  these  commands  onto  the  object  “pizza”.  The  command 
str  (pizza)  gives  us  an  overview  of  the  data  set,  see  also  Fig.  B.l.  The  output 
shows  that  the  data  set  contains  1266  observations  (deliveries)  and  13  variables. 
We  can  also  see  which  of  these  variables  are  factors  (categorical  with  defined 
categories)  and  which  are  numerical.  We  also  see  the  first  actual  numbers  for 
the  each  variable  and  the  coding  scheme  used  for  the  categories  of  the  factor 
variables.  The  command  dim  summarizes  the  dimension  of  the  data,  i.e.  the 
number  of  rows  and  columns  of  the  data  matrix.  Colnames  gives  us  the  names 
of  the  variables  from  the  data  set,  and  so  does  names.  The  commands  nrow  and 
ncol  give  us  the  number  of  rows  and  columns,  respectively.  Applying  head  and 
tail  to  the  data  prints  the  first  and  last  rows  of  the  data,  respectively. 


Solution  to  Exercise  1.6 

(a)  The  appropriate  study  design  is  a  survey.  The  information  would  be  obtained  via 
a  questionnaire  given  to  a  sample  of  parents.  It  is  not  a  controlled  experiment 
because  we  do  not  manipulate  one  particular  variable,  while  controlling  others; 
we  rather  collect  data  on  all  variables  of  interest. 

(b)  There  are  different  options  to  ask  for  parents’  attitudes:  of  course  one  could 
simply  ask  “what  do  you  think  of  immunization?”;  however,  capturing  long 
answers  in  a  variable  “attitude”  may  make  it  difficult  to  summarize  and  distil  the 
information  obtained.  A  common  way  to  deal  with  such  variables  is  to  translate 
a  concept  into  a  score:  for  example,  one  could  ask  5  “yes/no”-type  questions 
(instead  of  one  general  question)  which  relate  to  attitudes  towards  immunization, 
such  as  “do  you  think  immunization  may  be  harmful  for  your  child?”  or  “do  you 
agree  that  it  is  a  priority  to  immunize  infants  in  their  first  year  of  life?”  The 
number  of  answers  that  show  a  positive  attitude  towards  immunization  can  be 
summed  up.  If  there  are  5  questions,  there  are  up  to  5  points  “to  earn”.  Thus, 
each  parent  may  be  asked  5  questions  and  his/her  attitude  can  be  summarized 
on  a  scale  ranging  from  0  to  5,  depending  on  the  answers  given. 
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(c)  The  following  variables  are  needed: 

•  Attitude:  possibly  several  variables  are  needed  to  capture  parents’  information 
in  a  score,  see  (b)  for  details.  The  scale  is  ordinal  because  a  higher  score  relates 
to  a  more  positive  attitude  towards  immunization,  but  the  differences  between 
different  score  levels  cannot  be  interpreted  meaningfully. 

•  Immunized:  a  binary  (“yes-no”  type)  variable  capturing  whether  the  parent 
agrees  to  immunization  against  chickenpox  for  their  youngest  child  or  not. 
This  is  a  nominal  variable. 

•  Gender:  to  compare  “Immunized”  for  male  and  female  parents.  This  is  a 
nominal  variable. 

•  Age:  to  compare  the  age  distribution  in  the  group  of  parents  who  would  immu¬ 
nize  their  child  with  the  age  distribution  in  the  group  who  would  not  immunize 
their  child.  Age  is  measured  on  a  continuous  scale. 

(d)  A  data  set  might  look  as  follows: 

Parent  Ai  ...  A5  Attitude  Immunization  Gender  Age 

1  yes  . . .  yes  3 

2  no  ...  yes  2 

where  A\,  . . . ,  A 5  refer  to  variables  capturing  attitudes  towards  immunization 
and  “Attitude”  is  the  score  variable  summarizing  these  questions.  The  questions 
may  be  written  down  as  follows: 

•  What  is  the  average  attitude  score  towards  immunization  among  parents  and 
how  much  does  it  vary? 

•  What  percentage  of  parents  answer  “yes”  when  asked  whether  they  would 
immunize  their  youngest  child  against  chickenpox? 

•  What  is  the  difference  in  the  proportion  calculated  in  (b)  when  stratified  by 
gender? 

•  What  is  the  average  age  of  those  parents  who  would  immunize  their  child 
compared  with  the  average  age  of  those  parents  who  would  not  immunize 
their  child? 


yes 

no 


male  35 
female  26 


Chapter  2 


Solution  to  Exercise  2. 1 

(a)  The  table  shows  the  relative  frequencies  of  each  party  and  not  the  absolute 
frequencies.  We  can  thus  draw  a  bar  chart  where  the  relative  frequencies  of 
votes  are  plotted  on  the  y-axis  and  different  parties  are  plotted  on  the  v-axis.  In 
R ,  we  can  first  type  in  the  data  by  defining  two  vectors  and  then  use  the  “barplot” 
command  to  draw  the  bar  chart  (Fig.  B.2a).  Typing  “?barplot”  and  “?par”  shows 
the  graphical  options  to  improve  the  presentation  and  look  of  the  graph: 


326 


Appendix  B:  Solutions  to  Exercises 


N- 


Parties 

(a)  Results  of  the  2014  elections 


o 

o 


LD 

O  - 

o 


o 

o 

I  ANC  DA  EFF  IFP  COPE  Others 

Parties 


(b)  Difference  in  results  between  the 
2014  and  2009  elections 


Fig.  B.2  Bar  charts  for  national  elections  in  South  Africa 


results2014  <-  c (0 . 6215 , 0 . 2223 , 0 . 0635 , 0 . 0240 , 0 . 0067 , 0 . 0620) 
barplot (results2014) 

barplot (results2014 , names . arg=c ( 1 ANC 1 , ' DA ' , ' EFF ' , ' IFP ' , 'COPE' , 

1  Others  1 ) ,  col=gray . colors (6) ,ylim=c (0,0.7) ,xlab= ' Parties  1 ,ylab  = 
' Votes (%) 1 ) 


(b)  There  are  several  options  to  compare  the  results.  Of  course,  one  can  simply  plot 
the  two  bar  charts  with  each  bar  chart  representing  one  election.  It  would  be 
important  for  this  solution  to  ensure  that  the  y-axes  are  identical  in  both  the 
plots.  However,  if  we  want  to  compare  the  election  results  in  one  graph  then  we 
can  plot  the  difference  between  the  two  elections,  i.e.  the  win/loss  per  party.  The 
bar  chart  for  the  new  variable  “difference  in  proportion  of  votes  between  the  two 
elections”  is  shown  in  Fig.  B.2  and  is  obtained  as  follows: 

results2009  <-  c (0 . 6590 , 0 . 1666 , 0 , 0 . 0455 , 0 . 0742 , 0 . 0547) 
difference  <-  results2014-results2009 
barplot (difference) 


Remark  Another  solution  would  be  to  create  subcategories  on  the  v-axis:  for  exam¬ 
ple,  there  would  be  categories  such  as  “ANC  2009  results”  and  “ANC  2014  results”, 
followed  by  “DA  2009  results”  and  “DA  2014  results”,  and  so  on. 


Solution  to  Exercise  2.2 

(a)  The  scale  of  X  is  continuous.  However,  please  note  that  the  number  of  values 
X  can  practically  take  is  limited  (90  min  plus  extra  time,  recorded  in  1  min 
intervals). 
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Table  B.1  Frequency  table  and  other  information  for  the  variable  “time  until  first  goal” 


j 

lej-uej) 

nj 

fi 

dj 

hj 

F(x) 

1 

2 

3 

4 

5 

6 

7 

[0,  15) 

[15,30) 

[30,  45) 

[45,  60) 

[60, 75) 

[75,  90) 

[90,  96) 

19 

17 

6 

5 

4 

2 

2 

19 

55 

17 

55 

6 

55 

5 

55 

4 

55 

2 

55 

2 

55 

15 

15 

15 

15 

15 

15 

6 

19 

825 

17 

825 

6 

825 

5 

825 

4 

825 

2 

825 

2 

825 

19 

55 

36 

55 

42 

55 

47 

55 

51 

55 

53 

55 

1 

Total 

56 

1 

(b)  It  is  straightforward  to  obtain  the  frequency  table,  as  well  as  the  other  information 
needed  to  obtain  the  histogram  and  the  ECDF,  see  Table  B.l. 

(c)  We  need  to  obtain  the  heights  for  each  category  to  obtain  the  histogram  using 
hj  =  fj / dj,  see  Table  B.l. 


(d)  We  obtain  the  histogram  and  kernel  density  plot  in  R  (Fig.  B.3a)  using  the 
commands 

goals  <-  c(6,24,91, . . . ,7) 

hist (goals ,  breaks=c (0 , 15,30,45,60,75,90, 96) ) 
plot (density (goals ,  adjust=l , kernel= ' gaussian 1 ) ) 


LO 

CM 


Time  (in  min) 


(a)  Histogram 


(b)  Kernel  density  plot 


Fig.  B.3  Distribution  of  time  to  goal 
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Fig.  B.4  Empirical  cumulative  distribution  function  for  the  variable  “time  until  first  goal” 


(e)  The  ECDF  values  for  F(x)  are  calculated  using  the  relative  frequencies  fix), 
see  Table  B.l. 

(f)  (i)  We  can  easily  plot  the  ECDF  (Fig.  B.4a)  for  the  original  data  using  the  R 

command 

plot . ecdf (goals) 


(ii)  Generating  the  ECDF  for  the  grouped  data  requires  more  effort  and  is  not 
necessarily  intuitive:  first  we  categorize  the  continuous  variable  using  the 
function  cut.  We  use  the  label  option  to  indicate  that  the  name  of  each 
category  corresponds  to  the  upper  limit  of  the  respective  interval.  This  new 
variable  is  a  “factor”  variable  and  the  plot .  ecdf  function  does  not  work 
with  this  type  of  variable.  We  need  to  first  change  the  “factor”  variable  into  a 
“character”  variable  with  strings  corresponding  to  the  labels  and  coerce  this 
into  numeric  values.  Then  we  use  plot .  ecdf,  see  also  Fig.  B.4b.  Alterna¬ 
tively,  we  can  directly  replace  the  raw  values  with  numbers  corresponding 
to  the  upper  interval  limits. 

goals_cat  <-  cut(goals,  breaks=c (0 , 15 , 30 ,45 , 60 , 75 , 90 , 96) , 

labels=c (15,30,45,60,75,90, 96) ) 

plot . ecdf (as .numeric (as . character (goals_cat) ) 


(g)  To  solve  the  exercises,  we  simply  use  Formula  (2.11)  and  Rules  (2.3ff.) 

(i)  H(X  <  45)  =  F(45)  =  §  %  0.76. 

(ii)  H(X  >  80)  =  1  -  F(80)  =  1  -  +  ^(80  -  75))  %  0.085. 
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(iii)  H{ 20  <  X  <  65)  =  F( 65)  -  F{ 20)  =  53  +  Tp  •  (65  -  60)  -  (±§+ 
•  (20  -  15))  %  0.43. 


(h)  We  know  from  (2.11)  that 

F(xp)  =  p=  F  (e  j-i)  +  h  j(Xp  —  e 'j-i) 
with  hj  =  fj /dj  which  relates  to 


_  p  -  F(ej- 1) 

Xp  —  e j — i 

hi 

We  are  interested  in  vo.8  because  80  %  of  the  matches  have  seen  a  goal  at  this 
time  point: 


0.8  -  f 

x0.8  =  45  + - =  50.4. 

168 


We  conclude  that  80  %  of  the  “first  goals”  happened  up  to  50.4  min. 


Solution  to  Exercise  2.3 


(a)  We  obtain  the  relative  frequencies  for  the  first  and  fourth  intervals  as  0.125  (hj  • 
dj  =0.125-1).  Accordingly,  for  the  other  two  intervals,  we  obtain  frequencies 
of  fj  =  3-0.125  =  0.375. 

(b)  We  obtain  the  absolute  frequencies  for  the  first  and  fourth  intervals  as  250  (2000  • 
0.125).  For  the  other  intervals,  we  obtain  750  (2000  •  0.375). 


Solution  to  Exercise  2.4 


(a)  The  absolute  frequencies  nj  are  evident  from  the  following  table: 


j 

ej- 1 

eJ 

Fiej) 

fj 

rij(=  fjn) 

dj 

aj 

1 

8 

14 

0.25 

0.25 

0.25  •  500  =  125 

6 

11 

2 

14 

22 

0.40 

0.15 

75 

8 

18 

3 

22 

34 

0.75 

0.35 

175 

12 

28 

4 

34 

50 

0.97 

0.22 

110 

16 

42 

5 

50 

82 

1.00 

0.03 

15 

32 

66 

(b)  We  obtain  F(X  >  34)  =  1  -  F( 34)  =  1  -  0.75  =  0.25. 
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Table  B.2  Information  needed  to  calculate  the  ECDF 


Score 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Results 

1 

3 

8 

8 

27 

30 

11 

6 

4 

2 

fj 

1 

3 

8 

8 

27 

30 

li 

6 

4 

2 

too 

100 

100 

100 

100 

100 

100 

100 

100 

100 

Fj 

1 

4 

12 

20 

47 

77 

88 

94 

98 

1 

too 

100 

100 

100 

100 

100 

100 

100 

100 

Fig.  B.5  Empirical 
cumulative  distribution 
function  for  the  variable 
“score” 

s< 


Solution  to  Exercise  2.5 


(a)  The  data  needed  to  calculate  and  draw  the  ECDF  is  summarized  in  Table  B.2; 
the  ECDF  is  plotted  in  Fig.  B.5. 

(b)  It  follows  from  Table  B.2  that  F( 3)  =  12  %  and  F( 9)  =  98  %. 

(c)  The  grey  solid  line  in  Fig.  B.5  summarizes  the  ECDF  for  the  grouped  data. 
It  goes  from  (0,  0)  to  (1,  1)  with  a  breakpoint  at  (5,  0.47)  since  F(5)  =  0.47 
summarizes  the  information  for  the  group  “disagree”.  Using  (2.1 1)  we  calculate: 


F(3)  =  F(ej-1)+F(x-ej_l) 

dj 

0.47 

=  F(0)  +  —  -  (3  —  0)  =  28.2  % 
F( 9)  =  F( 5)  +  ^  .  (9  -  5)  =  89.4%. 


(d)  The  results  of  (b)  and  (c)  are  very  different.  The  formula  applied  in  (c)  assumes 
that  the  values  in  each  category  are  uniformly  distributed,  i.e.  that  within  each 
category,  each  value  occurs  as  often  as  each  other  value.  However,  we  know  from 
(a)  that  this  is  not  true:  there  are  more  values  towards  the  central  score  numbers. 
The  assumption  used  in  (c)  is  therefore  inappropriate  as  also  highlighted  in 
Fig.  B.5. 
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Solution  to  Exercise  2.6  We  read  in  and  attach  the  data  as  follows: 


setwd( ' C : /directory 1 ) 

pizza  <-  read. csv ( 'pizza_delivery. csv' ) 
attach(pizza) 


(a)  We  need  the  options  ylim,  xlim,  ylab,  xlab,  col  to  adjust  the 
histogram  produced  with  hist().  We  then  add  a  dashed  (lty=2)  line 
(type=  ’  1  ’ ),  which  is  thick  (lwd=3),  from  (65, 0)  to  (65, 400)  using  the  lines  ( ) 
command.  See  also  Fig.  B.6a. 


hist (temperature , xlab= 1  Temperature  1 , xlim=c (40 , 90) , 
ylim=c (0, 400) ,col=' lightgrey' ,ylab=' Number  of  deliveries') 
lines (c (65 ,65) , c (0 , 400) , type= ' 1 ' , lty=2 , lwd=3) 


(b)  We  can  create  the  histogram  as  follows,  see  also  Fig.  B.6b: 


library (ggplot2) 

pi  <-  ggplot (data=pizza, aes (x=temperature) ) 

p2  <-  pi  +  geom_histogram(f ill= ' darkgrey ' , alpha=0 . 5 ,binwidth=2 . 5) 
+  scale_y_continuous ( 1  Number  of  deliveries') 
plot (p2) 


(c)  A  possible  solution  is  as  follows,  see  also  Fig.  B.6c: 


barplot (table (driver) , ylim=c (0 , 200) , col=gray . colors (7) , 
ylab= ' Number  of  deliveries',  xlab= ' Driver ' ,main= ' Title ') ) 


(d)  We  can  produce  the  graph  (Fig.  B.6d)  as  follows: 


P3  <-  qplot (driver , data=pizza, aes= ( ' bar ' ) , f ill=day) 
p4  <-  p3  +  scale_f ill_grey ()  +theme_bw() 
plot (p4) 


200- 
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Chapter  3 


Solution  to  Exercise  3. 1 


(a)  The  arithmetic  means  can  be  calculated  as  follows: 

10 

i  , 

Xd 

m 

i  =  1 

10 

i  , 

*A 


1  ^  1 

=  —  >  jc,  =  — (12.5  +  •  •  •  +  17.5)  =  17.32, 

10^  10 

i=\ 

i  10  i 

=  —  >  Xi  =  —(342  +  •  •  •  +  466)  =  612.4. 
10  ^  10 


i  =  1 

The  ordered  values  of  the  two  variables  are: 


i:  1 


7  8 


10 


D:  7.6  12.1  12.5  14.8  16.2  16.5  17.5  18.5  27.4  29.9 
A:  238  342  398  466  502  555  670  796  912  1245 


1  „ 

*0.5, D  =  “(*(5)  +X(6))  = 

1  „ 

*0.5  ,A  =  “(*(5)  +  X(6))  = 


1 

-(16.2+16.5)  =  16.35, 
2' 

1 

-(502  +  555)  =  528.5. 


(b)  We  have  Tier  =  10-0.25  =  2.5  which  is  not  an  integer.  We  can  therefore  calculate 
the  25  %  quantiles,  i.e.  the  first  quartiles,  as 

*0.25,  D  =  *(3)  =  12.5;  *0.25,  A  =  *(3)  =  398. 

Similarly,  na  =  10-0.75  =  1. 5  is  not  an  integer  and  thus 

*0.75,  D  =  *(8)  =  18.5;  *0.75,  A  =  *(8)  =  796. 

One  can  see  that  the  distributions  for  both  variables  are  not  symmetric.  For 
example,  when  looking  at  the  distance  hiked,  the  difference  between  the  median 
and  the  first  quartile  (16.35  —  12.5)  is  much  larger  than  the  difference  between 
the  median  and  the  third  quartile  (18.5  —  16.35);  this  indicates  a  distribution  that 
is  skewed  towards  the  left. 

(c)  We  can  calculate  the  interquartile  ranges  as  follows: 

dQA  =  796  -  398  =  398;  dQ,D  =  18.5  -  12.5  =  6. 

The  mean  absolute  deviations  are: 

1 

Dd(3 0.5)  =  =  Y^d7-6  -  16.35|  +  •  •  •  +  |29.9  -  16.35|)  =  4.68, 

1 

Da(* 0.5)  =  =  Y^O238  “  528-5l  +  •  •  •  +  |1245  -  528.5|)  =  223.2. 


Distance  (in  km) 
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The  variances  of  both  the  variables  are 

s2d  =  3- ([7.6  -  16.35]2  +  •  •  •  +  [29.9  -  16.35]2)  ss  41.5, 

s2A  =  -f  ([238  -  528. 5]2  +  •  •  •  +  [1245  -  528.5]2)  82,  314. 

The  standard  deviations  are  therefore  So  =  V41.5  and  Sa  =  V82,  314. 

(d)  To  answer  this  question,  we  can  use  the  rules  of  linear  transformations. 

y  (=4)  0  +  3.28i  =  3.28  •  528.5  =  1722.48, 
s2  (3=9)  b2S2x  =  3.282  •  272.2  «  2928. 

(e)  To  draw  the  box  plots,  we  can  use  the  results  from  (a),  (b),  and  (c)  in  this 
solution.  The  median,  first  quartile,  third  quartile,  and  the  interquartile  range 
have  already  been  calculated.  It  is  also  easy  to  determine  the  minimum  and 
maximum  values  from  the  table  of  ordered  values  in  (a).  What  is  missing  is  the 
knowledge  of  whether  to  treat  some  of  the  values  as  extreme  values  or  not.  For  the 
distance  hiked,  an  extreme  value  would  be  any  value  >  18.5  +  1 .5  x  6  =  27.5  or 
<  12.5  -  1.5  x  6  =  3.5.  It  follows  that  there  is  only  one  extreme  value:  29.9  km. 
For  the  maximum  altitude,  there  is  no  extreme  value  because  there  is  no  value 
>796  +  1.5  x  398  =  1292  or  <398  —  1.5  x  398  =  —199.  The  box  plots  are 
shown  in  Fig.  B.7a,  b. 


(a)  Box  plot  for  distance  hiked 


(b)  Box  plot  for  maximum  altitude 


Fig.B.7  Box  plots  for  Exercise  3.1 
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(f)  The  data  can  be  summarized  as  follows: 


Class  intervals 

(5;  15]  (15;  20]  (20;  30] 

ni 

4  4  2 

fi 

4/10  4/10  2/10 

Tfj 

4/10  8/10  1 

We  can  calculate  the  weighted  arithmetic  mean  by  using  the  relative  frequencies 
fj  and  the  middle  of  the  intervals  mj : 


^44  2 

=  2_,  fjmj  =  —  •  10  +  —  •  17.5  +  —  •  25  =  16. 


10 


10 


10 


To  estimate  the  weighted  median,  we  need  to  determine  the  class  for  which 

m  —  1  m 

22  fj  <  °-5  and  22  //  -  °-5 

j= i  ;= i 

holds.  This  is  clearly  the  case  for  the  second  class  K 2  =  (15;  20].  Thus 


0.5  -  z?=7  fj 


*0.5  —  em- 1  + 


0.5  -0.4 


L 


r~‘  "J  dm  =  15  +  •  5  =  16.25, 


m 


0.4 


(g)  If  the  raw  data  is  known,  then  the  variance  for  the  grouped  data  will  be  identical 
to  the  variance  calculated  in  (c).  For  educational  purposes,  we  show  the  identity 
here.  The  variance  for  grouped  data  can  be  calculated  as: 


s 


2 


1 

n 


k 


22  nj(*j 

;= 1 


between 


within 


Using  the  arithmetic  mean  v  =  17.32  as  well  as  the  means  in  each  class,  x\  = 
11.75,  X2  =  17.225,  and  V3  =  28.65,  we  can  calculate  the  variance  between  the 
classes: 

~sl  =  T([4  .  (H.75  _  17.32)2]  +  [4-  (17.225  -  17.32)2] 

+[2  •  (28.65  -  17.32)2])  =  38.08735. 


The  variances  within  each  class  are: 

=  I  [(7.6  -  11.75)2  +  •  •  •  +  (14.8  -  11.75)2]  =  6.8025, 

s2  =  2 [(16.2  -  17.225)2  +  •  •  •  +  (18.5  -  17.225)2]  =  0.956875, 

s]  =  I [(27.4  -  28. 65)2  +  (29.9  -  28.65)2]  =  1.5625. 
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We  thus  get 

?2  =  —  (4  •  6.8025  +  4  •  0.956875  +  2  •  1.5625)  =  3.41625. 

The  total  variance  is  therefore  s 2  =  si  +  =  3.41625  +  38.08735  ~  41.5. 

The  results  will  typically  differ  if  we  do  not  know  the  raw  data:  we  have  to 
replace  the  arithmetic  means  within  each  class,  xj ,  with  the  middle  of  each  class 
aj ,  i.e.  a\  =  10,  <22  =  17.5,  <23  =  25: 

S2b  =  T([4  .  (10  -  17.32)2]  +  [4  •  (17.5  -  17.32)2] 

+[2  •  (25  -  17.32)2])  =  33.2424. 

We  further  get 

~s\  =  1- [(7.6  -  10)2  +  •  •  •  +  (14.8  -  10)2]  =  9.865, 

s\  =  2 [(16.2  -  17.5)2  +  •  •  •  +  (18.5  -  17.5)2]  =  0.9225, 

si  =  -[(27.4  -  25)2  +  (29.9  -  25)2]  =  1.5625, 

2 

and 

9  1 

si  =  —(4  •  9.865  +  4  •  0.9225  +  2  •  14.885)  =  7.292. 

The  variance  is  s2  =  J2  +  =  7.292  +  33.2424  ~  40.5.  The  approximation 

is  therefore  good.  However,  please  note  that  the  between-class  variance  was 
estimated  too  low,  but  the  within-class  variance  was  estimated  too  high;  only 
the  combination  of  the  two  variance  components  led  to  reasonable  results  in  this 
example.  It  is  evident  that  the  approximation  in  the  third  class  was  not  ideal.  The 
middle  of  the  interval,  25,  was  not  a  good  proxy  for  the  true  mean  in  this  class, 
28.65. 

(h)  It  is  easy  to  calculate  the  mean  and  the  median: 

distance  <-  c (12 . 5 , 29 . 9 , . . . , 17 . 5) 
altitude  <-  c (342 , 1245 ,... ,466) 
mean (distance) 
mean (altitude) 
median (distance) 
median (altitude) 

We  can  use  the  quantile  function,  together  with  the  probs  option,  to  get  the 
quantiles: 
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quantile (distance ,probs=0 . 75) 
quantile (distance ,probs=0 . 25) 
quantile (altitude ,probs=0 . 75) 
quantile (altitude ,probs=0 . 25) 

However,  the  reader  will  see  that  the  results  differ  slightly  from  our  results 
obtained  in  (b).  As  noted  in  Example  3.1.5 ,  R  offers  nine  different  ways  to  obtain 
quantiles,  each  of  which  can  be  chosen  by  the  type  argument.  The  difference 
between  these  options  cannot  be  understood  easily  without  a  background  in 
probability  theory.  It  may,  however,  be  worth  highlighting  that  we  get  the  same 
results  as  in  (b)  if  we  choose  the  type=2  option  in  this  example.  The  interquartile 
ranges  can  be  calculated  by  means  of  the  difference  of  the  quantiles  obtained 
above.  To  determine  the  mean  absolute  deviation,  we  have  to  program  our  own 
function: 


amd  <-  function (mv) {1/length (mv) *sum(abs (mv -median (mv) )) } 
amd (distance) 
amd (altitude) 


We  can  calculate  the  variance  using  the  var  command.  However,  as  noted  in 
Example  3.2.4,  on  p.  52,  R  uses  1  /(n  —  1)  rather  than  l/n  when  calculating  the 
variance.  This  important  alternative  formula  for  variance  estimation  is  explained 
in  Chap.  9,  Theorem  9.2.1.  To  obtain  the  results  from  (c),  we  hence  need  to 
multiply  the  output  from  R  by  (n  —  l)/n: 

var (altitude) *9/10 
var (distance) *9/ 10 

The  box  plots  can  be  drawn  by  using  the  boxplot  command: 


boxplot (altitude) 

1  j 

w*  Jy* 

1 1  % 

boxplot (distance) 

The  weighted  mean  is  obtained  as  follows: 

weighted. mean(c( 10, 17. 5, 25) ,c (4/10, 4/10, 2/10)) 
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Solution  to  Exercise  3.2 


(a)  We  need  to  solve  the  equation  that  defines  the  arithmetic  mean: 

x  =  —  ^  Xi 

n  ^ 

i 

1 

-90  =  —(200  +  600  -  200  -  200  -  200  -  100  -  100  -400  +  0  +  7?) 
1 

-90  =  -(-400  +  7?) 

10 

=>  7?  =  -500. 


(b)  The  mode  is  xm  =  —200.  Using  na  =  2.5  and  na  =  1.5,  respectively,  we  can 
determine  the  quartiles  as  follows: 

+3.25  =  +3)  =  —200, 

+3.75  =  +8)  “  0- 

The  interquartile  range  is  cIq  =  0  —  (—200)  =  200. 

(c)  It  is  not  possible  to  use  the  coefficient  of  variation  because  some  of  the  values 
are  negative. 


Solution  to  Exercise  3.3  We  calculate 

nm  =  n  —  nw  —  nc  =  100  —  45  —  20  =  35. 
Using  the  formula  for  the  arithmetic  mean  for  grouped  data, 


1 

x  =  —  (nwxw  +  nmxm  +  HqXq)  , 
n 


we  further  get 


1 

+71  —  (nx  nwxw  ncxc) 

Mm 

1 

=  —(100  •  15  -  45  •  16  -  20  •  7.5)  =  18. 

Similarly,  we  can  use  the  Theorem  of  Variance  Decomposition,  Eq.  (3.27),  to  calcu¬ 
late  the  variance  for  the  grouped  data: 

2  2,2  1  /  2,  2,  2\ 


1 


H —  (nw(xw  -  x)2  +  nm (xm  -  x)2  +  nc(xc  -  x)2) 
n 


This  yields 
1 


m 


=  —  \ns2  — 

nm 


nwsw  ~  ncsc 


-  nw(xw  -  x)2  -  nm (xm  -  x)2  -  nc{xc  *)2] 


-\2 


-\2' 


1 


=  —(100  •  19.55  -  45  •  6  -  20  •  3  -  45  •  l2  -  35  •  32  -  20  •  7.52)  =  4. 
35 
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Solution  to  Exercise  3.4 


(a)  We  evaluate  a  period  of  6  years  which  means  that  T  =  0,  1,2,  3,  4,  5.  To  deter¬ 
mine  the  average  growth  rate,  we  need  to  first  calculate  the  geometric  mean. 
There  are  two  options  to  facilitate  this: 


(i)  Solution: 

Bt/Bt-i  =  (-,  1.04,  1.125,  0.925,  1.2,  0.933) 

xq  =  (1.04  •  1.125  •  0.925  •  1.2  •  0.933)1/5  =  1.04. 


(ii)  Easier  solution: 

xG  =  (28/23) 1/5  =  1.04. 


Since  xq  =  1.04,  the  average  growth  rate  is  r  =  1.04  —  1  =  4  %. 

(b)  To  predict  the  number  of  members  in  2018,  we  could  apply  the  average  growth 
rate  to  the  number  of  members  in  2016  for  two  consecutive  years: 

-  —  _2 
#2018  =  *G#2017,  #2017  =  *G# 2016,  =>  #2018  =  %#2016 

£2018  =  1.042 .28  =  30.28  «  31. 

(c)  We  could  use  the  approach  outlined  in  (b)  to  predict  the  number  of  members 
in  2025.  However,  this  assumes  that  the  average  growth  rate  between  2011  and 
2016  remains  valid  until  2025.  This  is  rather  unrealistic.  The  number  of  members 
of  the  club  increases  in  some  years,  but  decreases  in  other  years.  There  is  no 
guarantee  that  the  pattern  observed  until  2016  can  be  generalized  to  the  future. 
This  highlights  that  statistical  methodology  should  in  general  be  used  with  care 
when  making  long-term  future  predictions. 

(d)  The  invested  money  is  JT  Xi  =€  250  million.  We  deal  with  partially  grouped 
data  because  the  club’s  members  invest  in  groups,  but  the  invested  sum  is  clearly 
defined  and  not  part  of  a  group  or  interval.  We  can  thus  summarize  the  data  as 
follows: 


(0 

1 

2 

3 

4 

Number  of  members 

10 

8 

8 

4 

Rel.  number  of  members  /, 

10/30 

8/30 

8/30 

4/30 

Si 

II 

M 

10/30 

18/30 

26/30 

1 

Money  invested  Xi 

40 

60 

70 

80 

Rel.  amount  per  group 

40/250 

60/250 

70/250 

80/250 

Vi 

40/250 

100/250 

170/250 

1 

The  Lorenz  curve  is  plotted  in  Fig.  B.8. 
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Fig.B.8  Lorenz  curve 

(e)  The  Gini  coefficient  can  be  calculated  using  formula  (3.37)  as  follows: 

G  =  1  -  -^(10(0  +  4/25)  +  8(4/25  +  2/5)  +  8(2/5  +  17/25) 
+  4(17/25  +  1))  =  1  -  268/375  =  107/375  =  0.2853. 

G+  (3=9)  30/29  •  107/375  =  214/725  =  0.2952. 

The  concentration  of  investment  is  thus  rather  weak. 


Solution  to  Exercise  3.5  Let  us  look  at  Fig.  3.8b  first.  The  quantiles  coincide  approx¬ 
imately  with  the  bisection  line.  This  means  that  the  distribution  of  “length  of  service” 
is  similar  for  men  and  women.  They  have  worked  for  about  the  same  amount  of  time 
for  the  company.  For  example,  the  median  service  time  should  be  approximately 
the  same  for  men  and  women,  the  first  quartile  should  be  very  similar  too,  the  third 
quartile  should  be  similar  too,  and  so  on.  However,  Fig.  3.8a  shows  a  somewhat  dif¬ 
ferent  pattern:  the  quantiles  for  men  are  consistently  higher  than  those  for  women. 
For  example,  the  median  salary  will  be  higher  for  men.  In  conclusion,  we  see  that 
men  and  women  have  a  similar  length  of  service  in  the  company,  but  earn  less. 


Solution  to  Exercise  3.6  There  are  many  ways  in  which  a  “mode”  function  can  be 
programmed.  We  present  a  simple  and  understandable  solution,  not  the  most  efficient 
one.  Recall  that  table  constructs  a  frequency  table.  This  table  shows  immediately 
which  value(s)  occur(s)  most  often  in  the  data.  How  can  we  extract  it?  Applying  the 


Appendix  B:  Solutions  to  Exercises 


341 


Fig.B.9  Lorenz  curves 


names  function  on  the  table  returns  the  names  of  the  values  represented  in  the  table. 
The  only  thing  we  need  to  do  is  to  choose  the  name  which  corresponds  to  the  value 
which  has  been  counted  most.  This  yields  the  following  function: 


mymode  <-  function(vec) { 
mt  <-  table (vec) 
names (mt) [mt  ==  max(mt)] 
} 


This  function  will  work  in  general,  though  it  returns  a  character  vector.  Using 
as  .numeric  is  one  option  to  make  the  character  string  numeric,  if  necessary. 


Solution  to  Exercise  3. 7 


(a)  In  this  exercise,  we  do  not  have  individual  data;  i.e.  we  do  not  know  how  much 
each  inhabitant  earns.  The  summarized  data  simply  tells  us  about  the  wealth 
of  two  groups.  For  simplicity  and  for  illustrative  purposes,  we  assume  that  the 
wealth  is  equally  distributed  in  each  group.  We  determine  ( ii[ ,  ty)  as  (0.8,  0.1) 
and  (1,1)  because  80  %  of  the  population  earn  10  %  of  the  wealth  and  100  % 
of  the  population  earn  everything.  The  respective  Lorenz  curve  is  illustrated  in 
Fig.  B.9a. 

(b)  The  upper  class  lost  its  wealth.  This  means  that  20  %  of  the  population  do  not 
own  anything  at  all.  However,  the  remaining  80  %  owns  the  rest.  This  yields 
(Hi,  Vi)  of  (0.2,  0)  and  (1,  1),  see  also  Fig.  B.9b. 

(c)  In  this  scenario,  20  %  of  the  population  leave  the  country.  However,  the  remaining 
80  % — which  are  now  100  %  of  the  population — earn  the  rest.  The  money  is 
equally  distributed  between  the  population.  Figure  B.9c  shows  this  situation. 
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Solution  to  Exercise  3.8 

(a)  It  is  necessary  to  use  the  harmonic  mean  to  calculate  the  average  speed.  Using 
w\  =  n\/n  =  180/418  ~  0.43,  W2  =  117/418  ~  0.28,  and  =  121/418  ~ 
0.29  we  get 

1 

xm  =  — j - 

V1*  Wi 

2^i  =  l  Xi 

1 

=  -  ^  45.2  km/h. 

0.43/48  +  0.28/37  +  0.29/52  ' 

(b)  Travelling  at  45.2  km/h  means  travelling  about  361  km  in  8  h.  The  bus  will  not 
be  in  time. 


Solution  to  Exercise  3.9 


(a)  The  sum  of  investment  is  X/  xi  =  €18,  020.  To  calculate  and  draw  the  Lorenz 
curve,  we  need  the  following  table: 


(0 

1 

2 

3 

4 

investment  v; 

800 

2220 

4700 

10300 

fj 

1/4 

1/4 

1/4 

1/4 

Ui 

1/4 

2/4 

3/4 

1 

relative  investment 

0.044 

0.123 

0.261 

0.572 

Vi 

0.044 

0.168 

0.428 

1 

The  curve  is  presented  in  Fig.  B.10. 

(b)  We  can  calculate  the  Gini  coefficient  as  follows: 


G 


G+ 


(3.37) 


(3.39) 


1 

1  -  -[(0  +  0.044)  +  (0.044  +  0.168)  +  (0.168  +  0.428) 

1 

+  (0.428  +  1)]  =  1  —  -  •  2.28  =  0.43. 
n  4 

- G=  -  -0.43  =  0.57. 

n  -  1  3 


(c)  The  Gini  coefficient  remains  the  same  as  the  relative  investment  stays  the  same. 

(d)  Using  the  library  ineq  we  can  easily  reproduce  the  results  in  R: 


library (ineq) 

investment  <-  c (800 , 10300 ,4700 , 2200) 
plot (Lc (investment) ) 
ineq ( inve  stment ) 
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Fig.B.10  Lorenz  curve  for  investment  sum 

However,  please  note  that  the  ineq  command  calculates  the  unstandardized 
Gini  coefficient. 


Solution  to  Exercise  3.10 


(a)  The  easiest  way  to  get  all  these  measures  is  to  use  the  summary  function  and 
apply  it  to  the  data  columns  which  relate  to  quantitative  variables: 

set wd ( ' C : /yourpath ' ) 

pizza  <-  read . csv ( 1 pizza_de livery . csv 1 ) 
attach(pizza) 

summary (pizza [ , c ( 'time ' , ' temperature ' , 1  bill ' , ' pizzas ' ) ] ) 


We  then  get  the  following  output: 


time 


Min. 

12.27 

1st  Qu. 

30.06 

Median 

34.38 

Mean 

34.23 

3rd  Qu. 

38.58 

Max . 

53.10 

temperature 


Min. 

41.76 

1st  Qu. 

58.24 

Median 

62.93 

Mean 

62.86 

3rd  Qu. 

67.23 

Max . 

87.58 

bill 


Min. 

9.10 

1st  Qu. 

35.50 

Median 

42.90 

Mean 

42.76 

3rd  Qu. 

50.50 

Max . 

75.00 

pizzas 


Min. 

1.000 

1st  Qu. 

2.000 

Median 

3.000 

Mean 

3.013 

3rd  Qu. 

4.000 

Max . 

11.000 
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(b)  We  can  use  the  quantile  function: 

quantile (time ,probs=0 . 99) 
quantile (temperature ,probs=0 . 99) 


The  results  are  48.62  min  for  delivery  time  and  79.87  °C  for  temperature.  This 
means  99  %  of  the  delivery  times  are  less  than  or  equal  to  48.62  min  and  1  %  of 
deliveries  are  greater  than  or  equal  to  48.62  min.  Similarly,  only  1  %  of  pizzas 
were  delivered  with  a  temperature  greater  than  79.87  °C. 

(c)  The  following  simple  function  calculates  the  absolute  mean  deviation: 


amdev  <-  function (mv) {1/length (mv) *sum(abs (mv-mean(mv) )) } 
amdev (temperature) 


(d)  We  can  use  the  scale,  mean,  and  var  commands,  respectively. 

sc. time  <-  scale (time) 
mean(sc .time) 
var (sc .time) 


As  one  would  expect,  the  mean  is  zero  and  the  variance  is  1  for  the  scaled 
variable. 

(e)  The  boxplot  command  draws  a  box  plot;  the  range  option  specifies  the  range 
for  which  extreme  values  are  defined.  As  specified  in  the  help  files,  range =0 
produces  a  box  plot  with  no  extreme  values. 


boxplot (temperature , range=0) 
boxplot (time , range=0) 


The  box  plots  are  displayed  in  Fig.  B.l  1. 

(f)  We  use  the  cut  command  to  create  a  variable  which  has  the  categories  (10,  20], 
(20,  30],  (30,  40],  (40,  50],  (50,  60],  respectively.  Using  the  interval  mid-points, 
as  well  as  the  relative  frequencies  in  each  class  (obtained  via  the  table  com¬ 
mand),  we  get: 


tc  <-  cut (time ,breaks=seq(10 , 60 , 10) ) 

weighted. mean (c (15 ,25,35,45,55) , table (tc) / sum (table (tc) ) ) 
[1]  34.18641 
mean (time) 

[1]  34.22955 


The  weighted  mean  is  very  similar  to  the  mean  from  the  raw  data,  see  output 
above. 
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Fig.  B.11  Box  plots 


(g)  The  plots  can  be  reproduced  by  using  the  qqplot  command: 

qqplot (time  [driver== ' Luigi ' ] , time [driver== 1  Domenico ' ] ) 
qqplot (time [driver== ' Mario ' ] , time [driver== 1  Salvatore ' ] ) 


Chapter  4 

Solution  to  Exercise  4. 1 

(a)  We  need  the  following  table  to  calculate  the  correlation  coefficient  R : 


Cafe  (i) 

Xi 

R(yi) 

di 

4 

1 

3 

1 

6 

2 

-1 

1 

2 

8 

4 

7 

3 

1 

1 

3 

7 

3 

10 

5 

-2 

4 

4 

9 

5 

8 

4 

1 

1 

5 

5 

2 

4 

1 

1 

1 

6 y,F-_ i  6(1  +  1+  4+1  + 1) 

n(n 2  -  1)  “  5(25  -  1) 


=  1  -0.4  =  0.6. 


There  is  a  moderate-to-strong  positive  correlation  between  the  ratings  of  the  two 
coffee  enthusiasts.  In  general,  a  high  rating  from  one  staff  member  implies  a 
rather  high  rating  from  the  other  staff  member. 

(b)  Above  we  have  assigned  ranks  in  an  increasing  order;  i.e.  the  lowest  x*  /y*  gets 
the  lowest  rank  (1)  and  the  highest  Xi/yt  gets  the  highest  rank  (5).  If  we  use 
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decreasing  order  and  assign  the  lowest  rank  to  the  highest  values,  we  get  the 
following  results: 


Cafe  (i) 

Xj 

R(xj) 

yt 

R(yt) 

dj 

df 

1 

3 

5 

6 

4 

1 

1 

2 

8 

2 

7 

3 

-1 

1 

3 

7 

3 

10 

1 

2 

4 

4 

9 

1 

8 

2 

-1 

1 

5 

5 

4 

4 

5 

-1 

1 

As  in  (a),  we  have  df  =  8  and  therefore,  the  results  are  identical:  R  =  0.6. 
Depending  on  whether  ranks  are  assigned  in  an  increasing  order  or  a  decreasing 
order,  the  sign  of  d;  differs,  but  the  calculation  of  R  is  not  affected  since  the 
squared  values  of  d;  are  used  for  its  calculation  and  the  sign  of  d;  is  thus  not 
important. 

(c)  We  can  summarize  the  ratings  in  a  2  x  2  table: 


X  Y 

Coffee  Bad 
Quality  Good 

2  1 

3  4 

The  odds  ratio  is  OR  =  (2  x  4)/ (3  x  1)  =  2.  The  chance  of  rating  a  coffee  as 
good  is  twice  as  likely  for  person  X  compared  to  person  Y. 


Solution  to  Exercise  4.2 


(a)  The  expected  frequencies  (under  independence)  are: 


Satisfied  Unsatisfied 

Car 

Car  (diesel  engine) 

Motorbike 

^  -  28-61  71508'  -  29-39 

14m  =  29-6  71650°  =  30.4 
7«2  =  15.79  7?,3n2  =  16.21 
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We  therefore  have 


rii+n+j 

n 


) 


2 


ni+n+j 


i  =  1  7  =  1  n 

_  (33  —  28. 61)2  (  (25  —  29. 39)2  (  (29  -  29.6)2 
“  28.61  29.39  +  297 

(31  -  30.4)2  (12  -  15.79)2  (20  -  16.21)2 

+  -  - 

30.4  15.79  16.21 

=  0.6736  +  0.6557  +  0.0122  +  0.0112  +  0.9097  +  0.8861  =  3.1485. 


The  maximum  value  x2  can  take  is  150(2  —  1)  =  150  which  indicates  that  there 
is  almost  no  association  between  type  of  vehicle  and  satisfaction  with  the  insur¬ 
ance.  The  other  measures  can  be  calculated  as  follows: 


C 


corr  • 


v  = 


x^ 


n(min(k,  l)  —  1) 


3.1485 
150(2  -  1) 


=  0.14. 


6xorr  — 


X^ 


min(k,  /) 
min(k,  /)  —  1  ]j  x2  +  n 


3.1485  r  , - 

„  - =  V2V0.02056  %  0.20. 

1  V  3.1485  +  150 

The  small  values  of  V  and  Ccor r  confirm  that  the  association  is  rather  weak, 
(b)  The  summarized  table  looks  as  follows: 


Satisfied  Unsatisfied 

Car 

Motorbike 

62  56 

12  20 

Using  (4.7),  we  obtain 


x2  = 


n(ad  —  bc)‘ 


( ci  H-  d)(c  H-  d)(a  H-  c)(b  H-  d ) 
150(1240  -  672)2  48,393,600 


2.2788. 


118 -32-74.76  21,236,224 

The  maximum  value  x2  can  take  is  150(2  —  1).  The  association  is  therefore 
weak.  The  odds  ratio  is 

1240 


ad  62  •  20 

OR  =  —  =  - 

be  12  -  56 


672 


1.845. 


The  chances  of  being  satisfied  with  the  insurance  are  1.845  times  higher  among 
those  who  drive  a  car. 
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Fig.  B.12  Scatter  diagram 
for  speed  limit  and  number 
of  deaths 


(c)  All  x2 -based  statistics  suggest  that  there  is  only  a  small  association  between  the 
two  variables,  for  both  the  2x3  and  the  2x2  tables.  However,  the  odds  ratio 
gives  us  a  more  nuanced  interpretation,  showing  that  customers  driving  a  car 
are  somewhat  more  satisfied  with  their  insurance.  The  question  is  whether  the 
additional  information  from  the  odds  ratio  is  stable  and  trustworthy.  Confidence 
intervals  for  the  odds  ratio  can  provide  guidance  under  such  circumstances,  see 
Sect.  9.4.4  for  more  details. 


Solution  to  Exercise  4.3 


(a)  The  scatter  plot  is  given  in  Fig.  B.12.  The  black  circles  show  the  five  observations. 
A  positive  relationship  can  be  discovered:  the  higher  the  speed  limit,  the  higher 
the  number  of  deaths.  However,  “Italy”  (the  observation  on  the  top  right)  is  the 
observation  which  gives  the  graph  a  visible  pattern  and  drives  our  impression 
about  the  potential  relationship. 

(b)  Using  v  =  61  and  y  =4.86  we  obtain 

Sxx  =  (55  -  61)2  +  (55  -  61)2  +  •  •  •  +  (75  -  61)2  =  270 

Syy  =  (4.1  -  4.86)2  +  (4.7  -  4.86)2  +  •  •  •  +  (6.1  -  4.86)2  =  2.512 

Sxy  =  (55  -  61) (4. 1  -  4.86)  +  •  •  •  +  (75  -  61)(6.1  -  4.86)  =  23.2 
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and  therefore 


Sxy  23.2 

jSxxSyy  V270  •  2.512 


0.891. 


The  correlation  coefficient  of  Spearman  is  calculated  using  the  following  table: 


Country  (/) 

Xi 

R(xi) 

yt 

R(yt) 

d[  df 

Denmark 

55 

4.5 

4.1 

5 

-0.5  0.25 

Japan 

55 

4.5 

4.7 

3 

1.5  2.25 

Canada 

60 

2.5 

4.3 

4 

-1.5  2.25 

Netherlands 

60 

2.5 

5.1 

2 

0.5  0.25 

Italy 

75 

1 

6.1 

1 

o 

o 

This  yields 


6  •  5 

1 - 

5(25  -  1) 


0.75. 


Please  note  that  above  we  averaged  the  ranks  for  ties.  The  R  function  cor  uses  a 
more  complicated  approach;  this  is  why  the  results  differ  slightly  when  using  R. 


(c)  The  results  stay  the  same.  Pearson’s  correlation  coefficient  is  invariant  with 
respect  to  linear  transformations  which  means  that  it  does  not  matter  whether 
we  use  miles/h  or  km/h. 


(d)  (i)  The  grey  square  in  Fig.  B.12  represents  the  additional  observation.  The 
overall  pattern  of  the  scatter  plot  changes  with  this  new  observation  pair: 
the  positive  relationship  between  speed  limit  and  number  of  traffic  deaths 
is  not  so  clear  anymore.  This  emphasizes  how  individual  observations  may 
affect  our  impression  of  a  scatter  plot. 

(ii)  Based  on  the  same  approach  as  in  (b),  we  can  calculate  v  =  62.5,  y  = 
4.6333,  Sxx  =  337.5,  Syy  =  4.0533,  Sxy  =  13,  and  r  =  0.3515.  A  single 
observation  changes  our  conclusions  from  a  strong  positive  relationship 
to  a  moderate-to-weak  relationship.  It  is  evident  that  Pearson’s  correlation 
coefficient  is  volatile  and  may  be  affected  heavily  by  outliers  or  extreme 
observations. 


Solution  to  Exercise  4.4 


(a)  The  contingency  table  for  the  absolute  frequencies  is  as  follows: 


1.  Class  2.  Class  3.  Class  Staff 

Total 

Rescued 

Not  rescued 

202  125  180  211 

135  160  541  674 

718 

1510 

Total 

337  285  721  885 

2228 
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(b)  To  obtain  the  conditional  relative  frequency  distributions,  we  calculate  the  pro¬ 
portion  of  passengers  rescued  (X)  for  each  travel  class  (T).  In  the  notation  of 
Definition  4.1.1,  we  determine  /■  =  fij/f+j  =  riij/n+j  for  all  i  and  j.  For 

example,  /reScued|i.  class  =  202/337  =  0.5994.  This  yields 


1.  Class 

2.  Class 

3.  Class 

Staff 

Rescued 

0.5994 

0.4386 

0.2497 

0.2384 

Not  rescued 

0.4006 

0.5614 

0.7503 

0.7616 

It  is  evident  that  the  proportion  of  passengers  being  rescued  differs  by  travel 
class.  It  can  be  speculated  that  there  is  an  association  between  the  two  variables 
pointing  towards  better  chances  of  being  rescued  among  passengers  from  higher 
travel  classes. 

(c)  Using  (4.3),  we  get 


1.  Class 

2.  Class 

3.  Class 

Staff 

Total 

Rescued 

108.6 

91.8 

232.4 

285.2 

718 

Not  rescued 

228.4 

193.2 

488.6 

599.8 

1510 

Total 

337 

285 

721 

885 

2228 

which  can  be  used  to  calculate  x2  and  V  as  follows: 


x2  = 


(ntj  - 

rii+n+j 


(202  -  108.6)2  (125  -  91.8)2 


i  =  1  7  =  1 


108.6 


91.8 


n 


(180  -  232. 4)2  (211  -  285. 2)2  (135  -  228.4)' 

+ - ~  t - + - - +  — 


232.4 


285.2 


228.4 


(160  -  193.2)2  (541  -  488. 6)2  (674  -  599.8)2 

+  193.2  +  488.6  +  599.8 

=  80.33  +  12.01  +  11.82  +  19.30  +  38.19  +  5.71  +  5.62  +  9.18 

=  182.16. 


V  = 


r 


182.16 


n(mm(k,l)  -  1)  V  2228(2-  1) 


=  0.286. 


The  value  of  V  indicates  a  moderate  or  weak  relationship  between  the  two 
variables.  This  is  in  contradiction  to  the  hypothesis  derived  from  the  conditional 
distributions  in  (b). 


(d)  The  table  is  as  follows: 


1.  Class/2.  Class  3.  Class/Staff  Total 

Rescued 

Not  rescued 

327  391  718 

295  1215  1510 

Total 

622  1606  2228 
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Using  (4.7)  we  get 


9  2228(327  •  1215  -  295  •  391)' 

X  =  - 


718-  1510-622-  1606 


=  163.55. 


and  therefore  V  =  =0.271. 

There  are  several  relative  risks  that  can  be  calculated,  for  example: 


/ill 

/l|2 

/2|1 

/2|2 


n\\/n+i 

nn/n+2 

mi/n+i 

ri22/n+2 


327/622 
391/1606 
295/622 
1215/1606 


2.16, 


0.63. 


The  proportion  of  passengers  who  were  rescued  was  2.16  times  higher  in  the  1 ./ 2. 
class  compared  to  the  3.  class  and  staff.  Similarly,  the  proportion  of  passengers 
who  were  not  rescued  was  0.62  times  lower  in  the  1./2.  class  compared  to  the  3. 
class  and  staff.  The  odds  ratio  is  OR  =  =  3.444.  This  is  nothing 

but  the  ratio  of  the  relative  risks,  i.e.  2.16/0.63.  The  chance  of  being  rescued 
(i.e.  the  ratio  rescued/not  rescued)  was  almost  3.5  times  higher  for  the  1./2.  class 
compared  to  the  3.  class  and  staff. 


(e)  While  Cramer’s  V  led  to  rather  conservative  conclusions  regarding  a  possible 
relationship  of  travel  class  and  rescue  status,  the  conditional  relative  frequency 
distribution,  the  relative  risks,  and  the  odds  ratio  support  the  hypothesis  that,  at 
least  to  some  degree,  the  chances  of  being  rescued  were  higher  in  better  travel 
classes.  This  makes  sense  because  better  travel  classes  were  located  towards  the 
top  of  the  ship  with  best  access  to  the  lifeboats  while  both  third-class  passengers 
and  large  numbers  of  the  staff  were  located  and  working  in  the  bottom  of  the 
ship,  where  the  water  came  in  first. 


Solution  to  Exercise  4.5 


(a)  Using  (4.17)  and  (4.18),  we  get 


Z;*=i  xm  ~  36iv  22776  -  36  •  12.22  -51.28 


r  = 


>xy 


■\J $xx  Syy 


ns^nSy 


nJ*Py 


216.9 


36V76.95  -  706.98 


0.026. 


This  indicates  that  there  is  no  linear  relationship  between  temperature  and  hotel 
occupancy. 
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(b)  The  scatter  plot  shows  no  clear  pattern.  This  explains  why  the  correlation  coeffi¬ 
cient  is  close  to  0.  However,  if  we  look  only  at  the  points  for  each  city  separately, 
we  see  different  structures  for  different  cities:  a  possible  negative  relationship 
for  Davos  (D),  a  rather  positive  relationship  for  Polenca  (P)  and  no  visible  rela¬ 
tionship  for  Basel  (B).  This  makes  sense  because  for  winter  holiday  destinations 
hotel  occupancy  should  be  higher  when  the  temperature  is  low  and  for  summer 
holiday  destinations  occupancy  should  be  high  in  the  summer  months. 

(c)  We  type  in  the  data  by  specifying  three  variables:  temperature  (X),  occu¬ 
pancy  (Y)  and  city  (Z).  We  then  simply  use  the  cor  command  to  calculate 
the  correlation — and  condition  on  the  values  of  Z  which  we  are  interested  in: 

X  <-  c(-6,-5,2,  .  .  .  ,9,4)  [■£ 

Y  <-  c(91,89,76, . . . ,9,12) 

Z  <-  c(rep( 'Davos ', 12) ,rep( 'Polenca' , 12) ,rep( 'Basel ', 12) ) 
cor (X [Z== ' Davos ' ] , Y [Z== ' Davos ' ] ) 
cor (X [Z== ' Basel ' ] , Y [Z== ' Basel ' ] ) 
cor (X  [Z== ' Polenca ' ] , Y [Z== ' Polenca ' ] ) 


This  yields  correlation  coefficients  of  —0.87  for  Davos,  0.42  for  Basel  and  0.82 
for  Polenca.  It  is  obvious  that  looking  at  X  and  Y  only  indicates  no  correlation, 
but  the  information  from  Z  shows  strong  linear  relationships  in  subgroups.  This 
example  shows  the  limitations  of  using  correlation  coefficients  and  the  necessity 
to  think  in  a  multivariate  way.  Regression  models  offer  solutions.  We  refer  the 
reader  to  Chap.  1 1,  in  particular  Sect.  1 1.7.3  for  more  details. 


Solution  to  Exercise  4.6 


(a)  We  use  the  visual  rule  of  thumb  and  work  from  the  top  left  to  the  bottom  right  for 
the  concordant  pairs  and  from  the  top  right  to  the  bottom  left  for  the  discordant 
pairs: 


K  =  82  •  43  +  82  •  9  +  82  •  2  +  82  •  10  +  8  •  2  +  8  •  10  +  4  •  10  +  4  •  9 
+43  •  10  =  5850 


£>  =  4-  8  +  9-  2  =  50 
_  K  -  D  _  5800 

7  “  K  +  D  ~  5900 


0.98. 


k 


X 


»!'  +  »+  j\  2 

n  ) 


ni+n+j 


86-90 \2 
158  ^  , 

86-90  “h 
158 


49-86x2 
158  1 
49-86 
158 


i  =  1  7=1 


n 
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+ 

+ 


(0 

(0 


19-86x2 
158  > 
19-86 
158 

12-90  x  2 
158  > 
12-90 
158 


+ 

+ 


sq  90-60^2 
(°  158  l 

90-60 
158 

n  12-49  \2 

158  >_ 
12-49 
158 


+ 

+ 


(43 

(10 


49-60  x  2 
158  1 
49-60 
158 

12-19x2 
158  } 
12-19 
158 


+ 


60-19x2 
158  1 
60-19 
158 


=  22.25.19.27  +  10.34  +  20.05  +  31.98  +  0.47  +  6.84  +  0.80  +  50.74 
=  162.74. 


V  = 


X^ 


n(min(k,  l)  —  1) 


162.74 

158-2 


0.72. 


(c)  The  table  is  as  follows: 


Use  a  leash 

Agree  or  no. 

Disagree 

Total 

Use  for  concerts  Agree  or  no. 

137 

9 

146 

Disagree 

2 

10 

12 

Total 

139 

19 

158 

(d)  The  relative  risk  can  either  be  summarized  as: 


2/139 

10/19 


~  0.03 


or 


10/19 

2/139 


36.6. 


The  proportion  of  those  who  disagree  with  using  the  park  for  summer  concerts  is 
0.03  times  lower  in  the  group  who  agree  or  have  no  opinion  about  using  leashes 
for  dogs  compared  to  those  who  disagree.  Similarly,  the  proportion  of  those  who 
disagree  with  using  the  park  for  summer  concerts  is  36.6  times  higher  in  the 
group  who  also  disagree  with  using  leashes  for  dogs  compared  to  those  who  do 
not  disagree. 


(e)  The  odds  ratio  is  OR  =  (137  •  10)/ (2  *  9)  ~  36. 1 . 


•  The  chance  of  not  disagreeing  with  the  concert  proposal  is  36.1  times  higher 
for  those  who  also  do  not  disagree  with  the  leash  proposal. 

•  The  chance  of  not  disagreeing  with  the  leash  proposal  is  36. 1  times  higher  for 
those  who  also  do  not  disagree  with  the  concert  proposal. 

•  In  simpler  words:  The  chance  of  agreeing  or  having  no  opinion  for  one  of  the 
questions  is  36.1  times  higher  if  the  person  also  has  no  opinion  or  agrees  with 
the  other  question. 


137-10-9-2 

7  = 


1352 


0.974. 


137-10  +  9-2  1388 
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(g)  In  general,  it  makes  sense  to  use  all  the  information  available,  i.e.  to  use  the 
ordinal  structure  of  the  data  and  all  three  categories.  While  it  is  clear  that  7  is 
superior  to  V  in  our  example,  one  may  argue  that  the  relative  risks  or  the  odds 
ratio  could  be  more  useful  because  they  provide  an  intuitive  quantification  on 
how  the  two  variables  relate  to  each  other  rather  than  just  giving  a  summary 
of  strength  and  direction  of  association.  However,  as  we  have  seen  earlier,  the 
interpretations  of  the  relative  risks  and  the  odds  ratio  are  quite  clumsy  in  this 
example.  It  can  be  difficult  to  follow  the  somewhat  complicated  interpretation. 
A  simple  summary  would  be  to  say  that  agreement  with  both  questions  was 
strongly  associated  (7  =  0.98). 


Solution  to  Exercise  4. 7  We  use  the  definition  of  the  correlation  coefficient,  replace 
yi  with  a  +  bxi  and  replace  y  with  a  +  bx  to  obtain 

Yti- 1  (•*/  ~  x)(a  +  bxi  -  (a  +  bx)) 
r  =  —  — '~L  '  - . 

-  *)2  Z!"=i(a  +  bxi  -  (a  +  M))2 

This  equates  to: 

r  _  ZLlh/  -x){b{xj  -x))  _  _ bY!j=\(xi  ~*)2 _  _  1 

yz?=i(*.--*)2su  (b(xi-x))2  72Z?=i(*;-i)22:?=i(*;-*)2 

Solution  to  Exercise  4.8 


(a)  We  read  in  the  data,  make  sure  the  first  column  is  recognized  as  containing  the 
row  names,  attach  the  data,  and  obtain  the  correlation  using  the  cor  command: 

decathlon  <-  read. csv( 1  decathlon. csv ' ,  row.names=l) 

attach (decathlon) 

cor (X . Discus , X . High .jump) 

The  correlation  is  0.52984.  This  means  there  is  a  moderate-to- strong  positive 
correlation;  i.e.  the  longer  the  distance  the  discus  is  thrown,  the  higher  the  height 
in  the  high  jump  competition. 

(b)  There  are  10  variables.  For  the  first  variable,  we  can  calculate  the  correlation  with 
9  other  variables.  For  the  second  variable,  we  can  also  calculate  the  correlation 
with  9  other  variables.  However,  we  have  already  calculated  one  out  of  the 
9  correlations,  i.e.  when  analysing  variable  number  one.  So  it  is  sufficient  to 
calculate  8  correlations  for  the  second  variable.  Similarly,  we  need  another  7 
correlations  for  the  third  variable,  6  correlations  for  the  fourth  variable,  and  so  on. 
In  total,  we  therefore  need  to  have  9  +  8  +  7  +  --  -  +  l=45  correlations.  Since 
the  correlation  coefficient  describes  the  relationship  between  two  variables,  it 
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Fig.  B.13  Correlation  matrix 
for  the  decathlon  data 
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X.IOOm 

1 

-0.7 

-0.37 

-0.31 

0.63 

0.54 

-0.23 

-0.26 

-0.01 

0.06 

X.Long.jump 

-0.7 

1 

0.2 

0.35 

-0.67 

-0.54 

0.25 

0.29 

0.09 

-0.15 

X. Shot. put 

-0.37 

0.2 

1 

0.61 

-0.2 

-0.25 

0.67 

0.02 

0.38 

0.13 

X.High.jump 

-0.31 

0.35 

0.61 

1 

-0.17 

-0.33 

0.52 

-0.04 

0.2 

0 

X.400m 

0.63 

-0.67 

-0.2 

-0.17 

1 

0.52 

-0.14 

-0.12 

-0.05 

0.55 

X.1 10m. hurdle 

0.54 

-0.54 

-0.25 

-0.33 

0.52 

1 

-0.22 

-0.15 

-0.08 

0.18 

X. Discus 

-0.23 

0.25 

0.67 

0.52 

-0.14 

-0.22 

1 

-0.18 

0.25 

0.22 

X.  Pole,  vault 

-0.26 

0.29 

0.02 

-0.04 

-0.12 

-0.15 

-0.18 

1 

-0.07 

0.18 

X.Javeline 

-0.01 

0.09 

0.38 

0.2 

-0.05 

-0.08 

0.25 

-0.07 

1 

-0.25 

X.1 500m 

0.06 

-0.15 

0.13 

0 

0.55 

0.18 

0.22 

0.18 

-0.25 

1 

makes  sense  to  summarize  the  results  in  a  contingency  table,  similar  to  a  matrix, 
see  Fig.  B.13. 

(c)  Using  cor  (decathlon)  yields  the  correlation  coefficients  between  all  variable 
pairs.  This  is  much  simpler  than  calculating  the  correlation  coefficient  for  each  of 
the  45  comparisons.  Note  that  the  correlation  matrix  provides  the  45  comparisons 
both  in  the  upper  triangle  and  in  the  lower  triangle  of  the  table.  We  know  that 
r  (X,  Y)  =  r(Y ,  X ),  but  R  still  provides  us  with  both,  although  they  are  identical. 
Note  that  the  diagonal  elements  are  1  because  r(X,  X)  =  1. 

(d)  One  way  to  omit  rows  with  missing  data  automatically  is  to  use  the  na .  omit 
command: 

cor (na . omit (decathlon) ) 


The  results  are  displayed  in  Fig.  B.13.  We  see  moderate-to- strong  correlations 
between  the  100  m  race,  400  m  race,  110  m  hurdle  race  and  long  jump.  This 
may  reflect  the  speed-and-athletic  component  of  the  decathlon  contest.  We  also 
see  moderate-to-strong  correlations  between  the  shot-put,  high  jump,  and  discus 
events.  This  may  reflect  the  strength-and-technique  component  of  the  contest. 
The  other  disciplines  also  show  correlations  which  make  sense,  but  they  are 
rather  weak  and  may  reflect  the  uniqueness  of  these  disciplines. 
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Solution  to  Exercise  4.9 


(a)  A  possible  code  is  listed  below: 


pizza  <-  read. csv ( 'pizza_delivery. csv' ) 

pizza$tempcat  <-  cut (pizza$temperature ,  breaks=c (0 , 65 , 100) ) 
pizza$timecat  <-  cut (pizza$time ,  breaks=c (0 , 30 , 100) ) 
attach(pizza) 

addmargins (table (tempcat , timecat) ) 


time cat 


tempcat 

(0,30] 

(30,100] 

Sum 

(0,65] 

101 

691 

792 

(65,100] 

213 

261 

474 

Sum 

314 

952 

1266 

We  can  see  that  there  is  a  higher  proportion  of  high  temperature  ((65,  100])  in 
the  category  of  short  delivery  times  ((0,  30])  compared  to  long  delivery  times 
((30,  100]). 

(b)  Using  the  data  from  (a),  we  can  calculate  the  odds  ratio: 

(101*261)/ (213*691) 


Thus,  the  chances  of  receiving  a  cold  pizza  are  0.18  lower  if  the  delivery  time  is 
short. 

(c)  We  use  the  vcd  and  ryouready  packages  to  determine  the  desired  measures  of 
association: 

library (vcd) 
library (ryouready) 
library (lattice) 

assocstats  (xtabs  (~tempcat+timecat) ) 
ord . gamma (table (tempcat , timecat) ) 
ord . tau (table (tempcat , timecat) ) 

barchart (table (tempcat , timecat) ,horizontal=F , stack=T) 
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timecat 

(0,30]  H  (30,100]  I  I 


800  - 


0  600  - 

O 

c 

<D 

c r 


(0,65]  (65,100] 

tempcat 

(a)  Stacked  bar  chart  for  the  catego¬ 
rical  time  and  temperature  variables 


•  • 
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Time 

(b)  Scatter  plot  for  temperature  and 
time 


0 

s— 

=3 

03 

4— 

0 

Q_ 

E 

,0 


o 

oo 


o 


o 

co 


o 

LO 


o 


o 

8 


Bruno  Domenico  Luigi  Mario  Salvatore 

Driver 


(c)  Temperature  by  driver 
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Operator 


(d)  Temperature  by  operator 


Fig.  B.14  Plots  for  Exercise  4.9 


Cramer’s  V  is  0.361,  Stuart’s  rc  is  —0.302,  and  7  is  —0.696.  The  first  two 
measures  indicate  a  moderate-to-weak  relationship  between  temperature  and 
delivery  time.  It  is  clear  from  the  last  two  measures  that  this  relationship  is 
negative,  i.e.  that  shorter  delivery  times  imply  higher  temperatures.  However, 
interestingly,  rc  and  7  provide  us  with  a  different  strengths  of  association.  In 
any  case,  it  is  clear  that  for  shorter  delivery  times  the  customers  receive  warmer 
pizzas,  as  evident  from  the  stacked  bar  chart  (Fig.  B.14a). 

(d)  The  scatter  plot  (Fig.  B.14b)  shows  a  decreasing  temperature  for  an  increasing 
delivery  time.  This  is  also  highlighted  in  the  correlation  coefficients  which  are 
—0.43  and  —0.39  for  Bravais-Pearson  and  Spearman,  respectively. 
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plot (time , temperature) 
cor (time , temperature) 

cor (time , temperature ,method= ' spearman ' ) 


(e)  It  makes  sense  to  compare  continuous  variables  (temperature,  number  of  pizzas, 
bill)  using  the  correlation  coefficient  from  Bravais-Pearson.  The  relationships 
between  temperature  and  driver  and  operator  could  be  visualized  using  stratified 
box  plots. 


boxplot  (temperature~driver) 
boxplot  (temperature~operator) 
cor (temperature , pizzas) 
cor (temperature , bill) 


The  correlation  coefficients  are  —0.37  and  —0.42  for  number  of  pizzas  and 
bill,  respectively.  More  pizzas  and  a  higher  bill  are  associated  with  a  lower 
temperature.  The  box  plots  (Fig.  B.14c,  d)  show  variation  in  the  delivery  times 
of  the  different  drivers,  but  almost  identical  delivery  times  for  the  two  operators. 
These  results  give  us  a  first  idea  about  the  relationship  in  the  data.  However, 
they  do  not  tell  us  the  full  story.  For  example:  is  the  pizza  temperature  for  higher 
bills  low  because  a  higher  bill  means  more  pizzas,  and  therefore,  a  less  efficient 
preparation  of  the  food?  Is  one  driver  faster  because  he  mainly  delivers  for  a 
particular  branch?  Could  it  be  that  the  operators  have  a  different  performance 
but  because  they  deal  with  different  branches  and  drivers,  these  differences  are 
not  visible?  To  address  these  questions,  a  multivariate  perspective  is  needed. 
Regression  models,  as  introduced  in  Chap.  11,  provide  some  answers. 


Chapter  5 


Solution  to  Exercise  5.1  There  are  n  =  10  guests  and  m  =  2  guests  shake  hands 
with  each  other.  The  order  is  not  important:  two  guests  shake  hands  no  matter  who 
is  “drawn  first”.  It  is  also  not  possible  to  shake  hands  with  oneself  which  means  that 
in  terms  of  the  urn  model,  we  have  the  “no  replacement”  case.  Therefore,  we  obtain 
the  solution  as 


10-9 
2  •  1 


handshakes  in  total. 
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Solution  to  Exercise  5.2  We  assume  that  it  is  not  relevant  to  know  which  student 
gets  tested  at  which  time  point.  We  are  thus  dealing  with  combinations  without 
considering  the  order.  We  have  n  =  25  and  m  =  5  and  therefore  obtain: 


/25\ 

(a)  a  total  number  of  ( 5  )  =  53,130  possibilities. 

(b)  a  total  number  of  (25+55_1)  =  (259)  =  118,755  possibilities. 

In  R ,  the  commands  choose  (25,5)  and  choose  (29,5),  respectively  provide  the 
required  results. 

Solution  to  Exercise  5.3  The  board  consists  of  n  =  381  knots.  Each  knot  is  either 
occupied  or  not.  We  may  thus  assume  a  “drawing”  without  replacement  in  the  sense 
that  each  knot  can  be  drawn  (occupied)  only  once.  If  we  place  m  =  64  counters  on 
the  board,  then  we  can  simultaneously  think  of  “drawing”  64  occupied  knots  out  of 
a  total  of  381  knots.  The  order  in  which  we  draw  is  not  relevant — either  a  knot  is 
occupied  or  not.  The  total  number  of  combinations  is  (”)  =  ~  4.35  •  1073.  We 

obtain  the  final  number  in  R  using  the  command  choose  (381,64). 


Solution  to  Exercise  5.4  We  obtain  the  results  (again  using  the  command 
choose (n,m)  in  R)  as  follows: 


(a)  The  customer  takes  the  beers  “with  replacement”  because  the  customer  can 
choose  among  any  type  of  beer  for  each  position  in  the  tray.  One  can  also  think 
of  an  urn  model  with  6  balls  relating  to  the  six  different  beers,  where  beers  are 
drawn  with  replacement  and  placed  on  the  tray.  The  order  in  which  the  beers  are 
placed  on  the  tray  is  not  relevant.  We  thus  have 

e+n-rn-o-”-'* 

combinations. 

(b)  If  the  customer  insists  on  having  at  least  one  beer  per  brewery  on  his  tray,  then 
6  out  of  the  20  positions  of  the  tray  are  already  occupied.  Based  on  the  same 
thoughts  as  in  (a),  we  calculate  the  total  number  of  combinations  as 

C+rH6+:rH::)— 


Solution  to  Exercise  5.5  Since  each  team  has  exactly  one  final  place,  we  have  a 
“without  replacement”  situation.  Using  n  =  32  and  m  =  3  (and  choose  (n  ,m)  in  R) 
yields 
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(a)  =  (332)3!  =  29,  760  and 

(b)  (32)  =  4960. 


Solution  to  Exercise  5.6  There  are  n  =  12  different  letters  for  m  =  4  positions  of 
the  membership  code.  Each  letter  can  be  used  more  than  once  if  desired  and  we 
thus  obtain  nm  =  124  =  20,736  possible  combinations.  We  therefore  conclude  that 
sufficient  membership  codes  are  left.  However,  this  may  not  last  long  and  the  book 
store  may  still  wish  to  create  another  solution  for  its  membership  codes. 


Solution  to  Exercise  5.7  For  each  member  of  the  jury,  there  are  61  scoring  options: 

0  0.1  0.2  0.3  0.4  0.5  0.6  0.7  0.8  0.9 
1  1.1  1.2  1.3  1.4  1.5  1.6  1.7  1.8  1.9 


5  5.1  5.2  5.3  5.4  5.5  5.6  5.7  5.8  5.9 

6 

Different  jury  members  are  allowed  to  assign  the  same  scores.  We  thus  deal  with 
combinations  “with  replacement”.  To  verify  this,  just  think  of  an  urn  with  61  balls 
where  each  ball  refers  to  one  possible  score.  Now  one  ball  is  drawn,  assigned  to  a 
specific  jury  member  and  then  put  back  into  the  urn.  Since  each  score  is  “attached” 
to  a  particular  jury  member,  we  have  combinations  with  consideration  of  the  order 
and  therefore  obtain  a  total  of  nm  =  619  ~  1.17  •  1016  possibilities.  If  you  have 
difficulties  in  understanding  the  role  of  “replacement”  and  “order”  in  this  example, 
recall  that  each  member  has  61  scoring  options:  thus,  61x61x---x61(9  times) 
combinations  are  possible. 


Solution  to  Exercise  5.8 


(a)  We  obtain: 


II 

1— 1 

$ 

II 

7o~to 

II 

1— 1 

0-0=0- 

0-0=0-  ■ 

0 = 10  ~  0  -  0  - 10 

(b)  Based  on  the  observations  from  (a)  we  conclude  that  each  entry  on  the  diagonal 
line  can  be  represented  by  Q) .  The  sum  of  two  consecutive  entries  is  thus  (")  + 
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in  vy  )• 


(,z^1).  Using  the  fourth  relation 

AA  /vi  -f  (5_5)  n(n  —  1)  (n  +  1  )n 


gmt) 


n(n  —  1+^  +  1) 


n 


Chapter  6 


Solution  to  Exercise  6. 1 

(a)  We  obtain 

•  AH  B  =  {8}. 

.  B  DC  =  {0}. 

•  ADC  =  {0,4,  8,  9,  15}. 

•  C\ A  =  { 4,9,  15}. 

•  Q  \  (B  U  A  U  C)  =  {6,  7,  11,  13,  14}. 

(b)  The  solutions  are: 

•  P(F )  =  1  -  P(F )  =  0.5. 

•  P(G)  =  1  -  P(E)  -  P(F)  =  0.3. 

•  P(E  D  G)  =  0  because  the  events  are  pairwise  disjoint. 

•  P(E  \E)  =  0. 

•  P(E  U  F)  =  P(E)  +  P(F)  =  0.7. 


Solution  to  Exercise  6.2  We  know  that  the  probability  of  failing  the  practical  exam¬ 
ination  is  P(PE)  =  0.25,  of  failing  the  theoretical  examination  is  P(T E)  =0.15, 
and  of  failing  both  is  P(PE  n  T E)  =  0.1. 

(a)  If  we  ask  for  the  probability  of  failing  in  at  least  one  examination,  we  imply 
that  either  the  theoretical  examination,  or  the  practical  examination,  or  both 
are  not  passed.  We  can  therefore  calculate  the  union  of  events  P(PE  U  TE)  = 
P(PE)  +  P(TE)  -  P(TE  H  PE)  =  0.25  +  0.15  -  0.1  =  0.3. 

(b)  P(PE\T E)  =  P(PE)  -  P(PE  H  TE)  =  0.25  -  0.1  =  0.15. 

(c)  P(PE  U  TE)  =  1  -  P(PE  U  TE)  =  1  -  0.3  =  0.7. 
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(d)  We  are  interested  in  the  probability  of  the  person  failing  exactly  in  one  exam.  This 
corresponds  to  P(M\C  U  C\M )  =  P(M  U  C)  —  P(C  Hi  M)  =  0.3  —  0.1  = 
0.2. 


Solution  to  Exercise  6.3  The  total  number  of  possible  simple  events  is  |  Q  \  =  12.  The 
number  of  favourable  simple  events  is 


(a)  |  A  |  =  6  (i.e.  the  numbers  2,  4,  6,  8,  10,  12).  Hence,  P(A)  =  = 

(b)  \B\  =  3  (i.e.  the  numbers  10,  11,  12).  Hence,  P{B)  =  ^ 

(c)  \C\  =  2  (i.e.  the  numbers  10,  12).  Hence,  P(AD  B)  =  ^ 

(d)  \D\  =1  (i.e.  the  numbers  2,  4,  6,  8,  10,  11,  12).  Hence,  P(AU  B) 


l 

2- 


_7_ 

12' 


Solution  to  Exercise  6.4  The  total  number  of  simple  events  is 


(a)  The  number  of  favourable  simple  events  is  one  and  therefore  P  (right  two 

presents)  =  j^j  =  ~  0.015. 

(b)  The  number  of  favourable  simple  events  is  (l,0)  because  the  person  draws  two 
presents  out  of  the  ten  “wrong”  presents: 

P  (wrong  two  presents)  =  pj  =  ~  0.682.  InSect.  8.1.8,  we  explain 

the  hypergeometric  distribution  which  will  shed  further  light  on  this  exercise. 


Solution  to  Exercise  6.5 

(a)  Let  V  denote  the  event  that  there  is  too  much  salt  in  the  soup  and  let  L  denote 
the  event  that  the  chef  is  in  love.  We  know  that 

P(V)  =  0.2  =>  P(V)  =  0.8. 

Similarly,  we  have 

P(L)  =  0.3  =>  P(L)  =  0.7. 

We  therefore  get: 

P(V  fl  L)  =  P(V\L)  •  P(L)  =  0.6  •  0.3  =  0.18. 

P(V  n  L)  =  P(L)  -  P(V  HL)  =  0.3  -  0.18  =  0.12. 

P(V  n  L)  =  P(V)  -  P(V  HL)  =  0.2  -  0.18  =  0.02. 

p(v  n  L)  =  p(V)  -  P(v  nL)  =  o.8  -  0.12  =  0.68. 

This  yields  the  following  contingency  table: 
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V  V 

Total 

L 

0.18  0.12 

0.3 

L 

0.02  0.68 

0.7 

Total 

0.2  0.8 

1 

(b)  The  variables  are  not  stochastically  independent  since,  for  example,  P(V)  • 
P(L)  =  0.3  •  0.2  =  0.06  ^  0.18  =  P(V  n  L). 


Solution  to  Exercise  6.6 


(a)  We  define  the  following  events:  G  =  Basil  is  treated  well,  G  =  Basil  is  not 
treated  well;  E  =  Basil  survives,  E  =  Basil  dies.  We  know  that 


P(G)=l-^P(G)  =  ^  P{E\G)=l~-  P(E\G ) 

Using  the  Law  of  Total  Probability,  we  get 

P(E)  =  P(E\G )  •  P(G)  +  P(E\G)  •  P(G) 


3 

4 


_  i  2  3  1  _  1  1  _  7 

-2’3  +  4'3“3+4-12 

(b)  We  can  use  Bayes’  Theorem  to  answer  the  question: 

P(E\G)  •  P(G) 


0.58. 


P(G\E)  = 


P(E\G )  •  P(G)  +  P(E\G)  •  P(G) 


3  1 

4  *  3 

J-  7 
12 


=  -  «  0.43. 


Solution  to  Exercise  6. 7  We  define  the  following  events  and  probabilities 

•  A:  Bill  never  paid,  P(A)  =  0.05  =>►  P(A)  =  0.95. 

•  M:  Bill  paid  too  late,  P(M )  =  ? 

•  P(Af|A)  =  0.2. 

•  P(M\A)  =  1  because  someone  who  never  pays  will  always  pay  too  late. 


(a)  We  are  interested  in  P(M ),  the  probability  that  someone  does  not  pay  his  bill  in 
a  particular  month,  either  because  he  is  not  able  to  or  he  pays  too  late.  We  can 
use  the  Law  of  Total  Probability  to  obtain  the  results: 

P(M)  =  P(M\A)P(A)  +  P(M\A)P(A) 

=  0.05  •  1  +  0.2  •  0.95  =  0.24. 
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(b)  We  can  use  Bayes’  Theorem  to  obtain  the  results: 

P(A)P(M  |  A)  0.05 
P(A  |  M)  =  —— — - —  =  - =  0.208. 

P(M)  0.24 

(c)  If  the  bill  was  not  paid  in  a  particular  month,  the  probability  is  20.8  %  that  it  will 
never  be  paid,  and  78.2  %  that  it  will  be  paid.  One  could  argue  that  a  preventive 
measure  that  affects  almost  79  %  of  trustworthy  customers  are  not  ideal  and  the 
bank  should  therefore  not  block  a  credit  card  if  a  bill  is  not  paid  on  time. 


Solution  to  Exercise  6.8 

(a)  The  “and”  operator  refers  to  the  joint  distribution  of  two  variables.  In  our  exam¬ 
ple,  we  want  to  know  the  probability  of  being  infected  and  having  been  trans¬ 
ported  by  the  truck.  This  probability  can  be  directly  obtained  from  the  respective 
entry  in  the  contingency  table:  40  out  of  200  cows  fulfil  both  criteria  and  thus 

40 

p(B  n  A)  =  — . 

v  s  AAA 


(b)  We  can  use  P( A)  =  ^  = 

P(B\A)  = 


200 

P(A)  to  obtain: 
P(BDA)  _  40/200 
P(A)  ~  100/200 


100 


(c)  Using  these  results  and  P  (B)  =  ^,and  P(B)  =  ^  =  1  —  P  (B),  we  obtain 


P(B  n  A)  20/200  20 

P(B\A)  =  — — = — -  =  — - - =  - 

P(A)  100/200  100 

by  using  the  Law  of  Total  Probability.  We  can  thus  calculate 

P(B)  =  P(B\A)P(A)  +  P(B\A)P(A) 

=  0.40  •  0.50  +  0.20  •  0.50  =  0.30. 

This  means  that  the  probability  of  a  cow  being  infected  is  30  %.  Alternatively, 
we  could  have  simply  looked  at  the  marginal  distribution  of  the  contingency 
table  to  get  P(B)  =  60/200  =  0.3. 

Solution  to  Exercise  6.9 


(a)  The  two  shots  are  independent  of  each  other  and  thus 

P(AHB)  =  0.4  •  0.5  =  0.2. 

P(A  U  B)  =  0.4  +  0.5  -  0.2  =  0.7. 

(b)  We  need  to  calculate 

P(A\B  U  B\A)  =  0.4  -  0.2  +  0.5  -  0.2  =  0.5. 


P(B\A)  =  0.5  -0.2  =  0.3. 
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Chapter  7 


Solution  to  Exercise  7. 1 


(a)  The  first  derivative  of  the  CDF  yields  the  PDF,  F\x)  =  f(x): 

0  if  v  <  2 

/OO  = 


-^x-j-2  if  2  <  v  <  4 


0 


if  v  >  4. 


(b)  We  know  from  Theorem  7.2.3  that  for  any  continuous  variable  P(X  =  xo)  =  0 
and  therefore  P(X  =  4)  =  0.  We  calculate  P(X  <  3)  =  P(X  <3 )  —  P(X  = 
3)  =  F(3)  —  0  =  — |  +  6  —  3  =  0.75. 

(c)  Using  (7.15),  we  obtain  the  expectation  as 

/oo  r  2  p4 

x  fix)  dx  =  x0dx+  x 
-oo  7—oo  J2 

*4 

=  o  + 


(~T  +  2) 


•oo 


— -v  +  2  J  dx  +  J  vO  dx 


J  -x2  +  2x^  dx  +  0 


1 


Given  that  we  have  already  calculated  E(X),  we  can  use  Theorem  7.3.1  to  cal¬ 
culate  the  variance  as  Var(X)  =  E(X2)  —  [E(X)]2.  The  expectation  of  X2  is 


E(XZ)  =  i  x - 


(4-+ x=L\-\ 


—  -x  +  2x  I  dx 


) 


1  4  2  +  . 

(  _  128  \  , 

/  „  16\ 

22 

--V4  +  -x3\  =  I 

32+  „  ) 

2+  1 

-  „ 

8  3  J2 

l  3  / 

t  3  ^ 

3 

We  thus  obtain  Var(X)  —  ^  —  (| )2  — 


66-64 


2 

9 


Solution  to  Exercise  7.2 


(a)  The  probability  mass  function  of  X  is 


Xi 

1  2  3  4  5  6 

P(X  =  x) 

1112  13 

9  9  9  9  9  9 

Using  (7.16),  we  calculate  the  expectation  as 


1112  1 
E(2f)  —  1  *  —  — F  2  *  —  — F  3  *  —  — F  4  *  —  — F  5  *  —  — F  6 

v/  9  9  9  9  9 

1+2  +  3  +  8  +  5  +  18  37 

_  _  _  _  r^y 


3 

9 


9 


9 


4.1. 
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To  obtain  the  variance,  we  need 

9  1  1  1  2  1  3 

E(X2)  =  l-  -+  4-  -+  9*-  +  16-  — |-  25  •  -  +  36  •  - 

\  /  ^  1  ^  1  ^  1  ^  1  ^  1  ^ 

_  1  +  4  +  9  +  32  +  25  +  108  _  179 
““  9  “  ~9~' 


Therefore,  using  Var(X)  =  E(X2)  —  [E(X)]2,  we  get 


Var(X)  = 


179 


1611  -  1369  242 


81 


81 


2.98. 


The  manipulated  die  yields  on  average  higher  values  than  a  fair  die  because  its 
expectation  is  4.1  >  3.5.  The  variability  of  is,  however,  similar  because  2.98 
2.92. 

(b)  The  probability  mass  function  of  Y  =  j  is: 


II 

*  |h- 

1  1  1  1  1  1 

1  2  3  4  5  6 

17 

II 

1112  13 

9  9  9  9  9  9 

The  expectation  can  hence  be  calculated  as 


E(T)  =  E 


1111112  1113 

1  •  —  T  —  ,  —  T  —  ,  —  T  —  ,  —  T  —  ,  —  T  —  •  — 
9  2  9  3  9  4  9  5  9  6  9 

1  ,  1  ,  1  ,  1  ,  1  ,  1  _  91 
9+18  +  27  +  l8+45  +  l8_  270' 


Comparing  the  results  from  (a)  and  (b)  shows  clearly  that  E(  j)  /  E(xJm  ^eca^ 
that  E (bX)  =  bE(X).  It  is  interesting  to  see  that  for  some  transformations  T  (X) 
it  holds  that  E(7" (X))  =  T (E(X)),  but  for  some  it  does  not.  This  reminds  us  to 
be  careful  when  thinking  of  the  implications  of  transformations. 


Solution  to  Exercise  7.3 


(a)  There  are  several  ways  to  plot  the  CDF.  One  possibility  is  to  define  the  function 
and  plot  it  with  the  curve  command.  Since  the  function  has  different  defini¬ 
tions  for  the  intervals  [oo,  0),  [0,  1],  (1,  oo],  we  need  to  take  this  into  account. 
Remember  that  a  logical  statement  in  R  corresponds  to  a  number,  i.e.  TRUE  =  1 
and  FALSE  =  0;  we  can  thus  simply  add  the  different  pieces  of  the  function  and 
multiply  them  with  a  condition  which  specifies  if  X  is  contained  in  the  interval 
or  not  (Fig.  B.15): 
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Fig.  B.15  Cumulative  distribution  function  for  the  proportion  of  wine  sold 


cdf  <  — function (x) { 

(3  *  x  32  —  2  *  jC3)  <=  1  Sex  >=  0)  +  1  *  (v  >  1)  +  0  *  (r  <  0) 

} 

curve (cdf ,f rom=-0 . 5 ,to=l . 5) 


(b)  The  PDF  is 


d 

dx 


F(x)  =  F'(x)  =  f{x)  = 


6(v 

0 


x2)  if  0  <  x  <  1 
elsewhere. 


(d)  We  have  already  defined  the  CDF  in  (a).  We  can  now  simply  plug  in  the  v -values 
of  interest: 

cdf (2/3) -cdf (1/3)  fR 
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(e)  The  variance  can  be  calculated  as  follows: 


1  n\ 

E(X)  =  /  x6(x  —  x2)dx  =  6  •  /  (x2  —  x2)dx 

fo  Jo 

1 1 


=  6 


1  3  1  4 

-X - X 

3  4 


Jo 

2- 


G-i) 


=  6.  ---1=0.5 


l  ri 

n  I  /„3  „4' 


E(X  )  =  /  x  6(x  —  x  )dx  =  6  •  /  (v  —  v  )dv 
f0  JO 

-ll 


=  6 


1  4  ^  5 

-V - V 

4  5 


Jo 


G-i) 


=  61  —  —  —  1=  0.3 


Var(X)  =  E(X2)  -  [E(X)]2  =  0.3  -  0.52  =  0.05. 


Solution  to  Exercise  7.4 


(a)  Two  conditions  need  to  be  satisfied  for  f(x)  to  be  a  proper  PDF: 


(i)  f2  f(x)dx  =  1 


fix)  dx  = 


■2  r-2 

c  -  x(2  —  x)dx  =  c  /  x(2  —  x)dx 

o  Jo 


=c  I  (2x  —  x2)dx  =  c 
f0 


2  1  3 

x - jr 

3 


~i  2 


Jo 


—C 


8 

4  -  -  -  (0  -  0) 


4  ! 

=  c  •  -  =  1 
3 


3 

*C=4' 


(ii)  /(*)>  0: 


fix)  =  -x(2  -  x)  >  0  V  x  e  [0,  2] 


(b)  We  calculate 


F(jc)  =  P(X  <  jc)  = 


X  f(t)dt=  (X-t(2-t)dt 
0  Jo  4 


3  rx  9  3  f  9  i  o"1* 

-  (2t  —  t2)dt  =  -  t2 - 1 3 

4  7o  4  [  3 


o 


3 

4 


x2  -  lx3  -  0 

3 


=  -v' 

4 


(■  -  H 
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and  therefore 


F(x)  = 


0 


if  x  <  0 


|x2(l  -  \x)  if  0  <  x  <  2 


if  2  <  x. 


(c)  The  expectation  is 


E(X )  = 


*2  3  r  2 

x/(x)dx  =  -  (2x2  —  x3)dx 

'o  4  Jo 


3 

2  n  l  a 

2  3 

r  2  i  i 

- x 3  -  -x4 

o 

1 

\Q 

1 

1 

oo 

1 

4 

[3  4 

o  4 

|_3  4  J 

3 

4 


16  12 


_  3  4  _ 
“4*3” 


Using  Var(X)  =  E(X2)  —  (E(X))2,  we  calculate  the  variance  as 

*2 


(d) 


3  f2 

E(X2)  =  /  x2/(v)dv  —  -  I  (2v3  —  v4)dv 

4  Jo 


3 

2  4  1  c 

2  3 

2  1 

-Jt4  -  -v5 

-  •  16 - 32-0 

4 

4  5 

o  4 

4  5 

^  3-32  ^  3-8 

=  6 - =6 - 

4-5  5 


6 

5 


6  ry  1 

Var(X)  =  -  -  l2  =  -. 


CT 


P(\X-fJL\  <  0.5)  —  = 


(h 


(0.5)- 


=  1  -0.8  =  0.2. 


Solution  to  Exercise  7.5 


(a)  The  marginal  distributions  are  obtained  by  the  row  and  column  sums  of  the  joint 
PDF,  respectively.  For  example,  P(X  =  1)  =  zf=,  Pij  =  P\+  =  1/4- 


Y 

II 

1 

1/6 

2 

7/12 

3 

1/4 

X 

P(X  =  Xi ) 

0 

3/4 

1 

1/4 
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The  marginal  distribution  of  X  tells  us  how  many  customers  sought  help  via  the 
telephone  hotline  (75  %)  and  via  email  (25  %).  The  marginal  distribution  of  Y 
represents  the  distribution  of  the  satisfaction  level,  highlighting  that  more  than 
half  of  the  customers  (7/12)  were  “satisfied”. 


(b)  To  determine  the  75  %  quantile  with  respect  to  Y,  we  need  to  find  the  value 
yo.75  for  which  F(y0j5 )  >  0.75  and  E(y)  <  0.75  for  y  <  yo.75*  T  takes  the 
values  1,  2,  3.  The  quantile  cannot  be  yo.75  =  1  because  E(l)  =  1/6  <  0.75. 
The  75  %  quantile  is  yo.75  =  2  because  F( 2)  =  1/6  +  7/12  =  3/4  >  0.75  and 
for  all  values  which  are  smaller  than  2  we  get  F(x)  <  0.75. 


(c)  We  can  calculate  the  conditional  distribution  using  P(Y  =  yj\X  =  1)  =  pij/ 
=  p\j/(l/6  +  1/12  +  0)  =  /?ij/(0.25).  Therefore, 

1/6  _  2 

1/4  ~  3’ 

1/12  _  1 

~  3’ 


P(Y  =  \\X  =  1)  = 


P(Y  =  2\X  =  1)  = 


P(Y  =  3\X  =  1)  = 


1/4 

0 

V4 


=  0. 


Among  those  who  used  the  email  customer  service  two-thirds  were  unsatisfied, 
one-third  were  satisfied,  and  no  one  was  very  satisfied. 


(d)  As  we  know  from  (7.27),  two  discrete  random  variables  are  said  to  be  indepen¬ 
dent  if  P(X  =  Xi,Y  =  yj)  =  P(X  =  Xi)P(Y  =  yj).  However,  in  our  example, 
P(X  =  0,  Y  =  1)  =  P(X  =  0 )P(X  =  1)  =  |  •  \  ^  0.  This  means  that  X  and 
Y  are  not  independent. 


(e)  The  covariance  of  X  and  Y  is  defined  as  Cov(X,  Y)  =  E(XT)  —  E(X)E(T).  We 
calculate 


3  1  1 

E(X)  =  0  •  -  +  1  •  -  =  - 

4  4  4 


E  (Y)  = 


1  7 

1  ■  -  +2  -  — 
6  12 


+  3-  - 


1 

4 


25 

U 


E  (XY)  =  01-0  +  1-1 


1 

7  +  0-2- 
6 


1 


+  1-2 


1 

12 


+  0-3 


1 

-  +  1-3-0 
4 


Cov(X,  Y) 


2 

6 

2  1  25 

6  _  4  '  12 


3 

16' 


Since  Cov(X,  Y)  <  0,  we  conclude  that  there  is  a  negative  relationship  between 
X  and  Y :  the  higher  the  values  of  X ,  the  lower  the  values  of  Y — and  vice 
versa.  For  example,  those  who  use  the  email-based  customer  service  (X  =  1) 
are  less  satisfied  than  those  who  use  the  telephone  customer  service  ( X  =  0). 
It  is,  however,  evident  that  in  this  example  the  values  of  X  have  no  order  and 
therefore,  care  must  be  exercised  in  interpreting  the  covariance. 
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Solution  to  Exercise  7.6  Using  Tschebyschev’s  inequality  (7.24) 


Var(X) 

PQX  -  /i\  <  c)  >  0.9  =  1 - 

we  can  determine  c  as  follows: 


1  - 


Var(X) 


=  0.9 


c  = 


c  = 


Var(X)  4 


0.1  0.1 

±n/40  =  ±6.325, 


=  40 


Thus,  the  interval  is  [15  —  6.325;  15  +  6.325]  =  [8.675;  21.325] 


Solution  to  Exercise  7. 7 


(a)  The  joint  PDF  is: 


Y 

0  1  2 

-1 
X  2 

0.3  0.2  0.2 

0.1  0.1  0.1 

(b)  The  marginal  distributions  are  obtained  from  the  row  and  column  sums  of  the 
joint  PDF,  respectively: 


Y 

0  1 

2 

P(Y  =  y ) 

0.4  0.3 

0.3 

-1  2 

>< 

II 

0.7  0.3 

(c)  The  random  variables  X  and  Y  are  independent  if 

P(X  =  x,Y  =  y)  =  P{X  =  x)P(Y  =  y)  Vx,y. 

However,  in  our  example  we  have,  for  example, 

P(X  =  -1,  Y  =  0)  =  0.3  ^  P(X  =  -1)  •  P(Y  =  0)  =  0.7  •  0.4  =  0.28. 
Hence,  the  two  variables  are  not  independent. 

(d)  The  joint  distribution  of  X  and  Y  can  be  used  to  obtain  the  desired  distribution 
of  U .  For  example,  If  X  =  —1  and  Y  =  0,  then  U  =  X  -\-Y  =  —  1 .  The  respec¬ 
tive  probability  is  P(U  =  —  1 )  =  0.3  because  P(U  =  —  1)  =  P(X  =  — 1,7  = 
0)  =  0.3  and  there  is  no  other  combination  of  X-  and  Y- values  which  yields 
X  +  Y  =  —  1.  The  distribution  of  U  is  therefore  as  follows: 
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k  -10 

1  2 

3 

4 

P(U  =  k )  0.3  0.2 

0.2  0.1 

0.1 

0.1 

(e)  We  calculate 


4 

E (U)  =  X  k'  P{V  =  =  °-8 

k=- 1 

E(X)  =  (-1)0.7  +  2  •  0.3  =  -0.1 
E(F)  =  0  •  0.4  +  1  •  0.3  +  2  •  0.3  =  0.9 


E (U2)  =  0.3  •  (-1)2  +  •  •  •  +  0.1  •  42  =  3.4 
E(X2)  =  0.7  •  (-1)2  +  0.3  •  22  =  1.9 
E(E2)  =  0.4  •  02  +  0.3  •  l2  +  0.3  •  22  =  1.5 

Var({/)  =  E (U2)  -  [E(C/>]2  =  3.4  -  (0.8)2  =  2.76 
Var(X)  =  E(X2)  -  [E(X)]2  =  1.9  -  (-0.1)2  =  1.89 
Var(E)  =  E(E2)  -  [E(E)]2  =  1.5  -  (0.9)2  =  0.69. 


It  can  be  seen  that  E(X)  +  E(y)  =  —0.1  +  0.9  =  0.8  =  E (U).  This  makes 
sense  because  we  know  from  (7.31)  that  E(X  +  T)  =  E(X)  +  E(y).  However, 
Var (U)  =  2.76  /  Var(X)  +  Var(T)  =  1.89  +  0.69.  This  follows  from  (7.7.1) 
which  says  that  Var(X  zb  7)  =  Var(X)  +  Var(T)  ±  2Cov(X,  Y)  and  therefore, 
Var(X  ±  Y)  =  Var(X)  +  Var(T)  only  if  the  covariance  is  0.  We  know  from  (c) 
that  X  and  Y  are  not  independent  and  thus  Cov(X,  Y)  ^  0. 


Solution  to  Exercise  7.8 


(a)  The  random  variables  X  and  Y  are  independent  if  the  balls  are  drawn  with 
replacement.  This  becomes  clear  by  understanding  that  drawing  with  replace¬ 
ment  implies  that  for  both  the  draws,  the  same  balls  are  in  the  urn  and  the 
conditions  in  each  draw  remain  the  same.  The  first  draw  has  no  implications  for 
the  second  draw. 

If  we  were  drawing  the  balls  without  replacement,  then  the  first  draw  could  possibly 
have  implications  for  the  second  draw:  for  instance,  if  the  first  ball  drawn  was  red, 
then  the  second  one  could  not  be  red  because  there  is  only  one  red  ball  in  the  urn. 
This  means  that  drawing  without  replacement  implies  dependency  of  X  and  Y.  This 
can  also  be  seen  by  evaluating  the  independence  assumption  (7.27): 

1  1 

P(X  =  2,  Y  =  2)  =  0  /  P(X  =  2)  •  P(Y  =  2)  = - . 
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(b)  The  marginal  probabilities  P(X  =  Xf)  can  be  obtained  from  the  given  informa¬ 
tion.  For  example,  3  out  of  8  balls  are  black  and  thus  P(X  =  1)  =  3/8.  The 
conditional  distributions  P(Y \X  =  Xi)  can  be  calculated  easily  by  realizing  that 
under  the  assumed  dependency  of  X  and  Y,  the  second  draw  is  always  based  on 
7  balls  (8  balls  minus  the  one  drawn  under  the  condition  X  =  x;) — e.g.  if  the 
first  ball  drawn  is  black,  then  7  balls,  2  of  which  are  black,  remain  in  the  urn  and 
P(Y  =  1\X  =  1)  =  2/7.  We  thus  calculate 

2  3  6 

P(Y  =  1,  X  =  1  )  =  P(Y  =  1\X  =  1  )P(X  =  1)  =  -  •  -  =  — 

P(Y  =  1,X  =  2  )  =  P(Y  =  1\X  =  2  )P(X  =  2)  =  l.l-  =  l- 

P (Y  =  3,  X  =  3)  =  P(Y  =  3\X  =  3 )P(X  =  3)  =  ^ 

and  obtain 


Y 

1 

2 

3 

1 

6 

3 

12 

1 

56 

56 

56 

X  2 

3 

56 

0 

4 

56 

12 

4 

12 

56 

56 

56 

(c)  The  expectations  are 

3  l  4  17 

E(X)  =  1  •  — 1-2  -  — |-  3  •  -  =  — 

8  8  8  8 

E(7)  =  E(X)  =  y. 

To  estimate  p(X,  T),  we  need  Cov(X,  Y)  as  well  as  Var(X)  and  Var(T): 


6  3  12  3  4  12  4  12 

E  (XT)  =  1—  +2—  +3—  +2—  +4-0  +  6—  +  3—  +6—  +  9  — 

56  56  56  56  56  56  56  56 

246 


56 

9  9  ^  3  9I  94  43 

E(X2)  =  E  (Y2)  =  l2-  +  22-  +  32-  =  — 

Va,(X)  =  EH2)  -  [EH)]2  =  f  -  (y)2  =  | 


55 

Var(T)  =  Var(X)  =  — 
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Using  (7.38)  and  Cov(X,  Y)  =  E (XY)  —  E(X)E(y),  we  obtain 

246  289 

:  56  64  =  -0.143. 


Cov(X,  Y) 


P  = 


VVar(X)  •  Var(Y) 


55  55 

64  '  64 


Solution  to  Exercise  7.9 


(a)  The  constant  c  must  satisfy 


100  ,100  /  ioo  —  x 


'40  J 10 
and  therefore 


/ 100  —  x\ 
c  I  — - - Jdxdy  = 


100  ,100  100c  , 

I - cdx  dy  =  \ 

r40  7l0  X 


100 


[100c  ln(x)  —  cx][q°  dy  —  I  100c  In  100  -  100c  -  100c ln(10)  +  10c dy 

r40  J 40 

which  is 


100 


100  /  ioo 

100c  ( In  —  -  1  + 

'40  V 


10 


T)  dy  =  [100cy  On  10-9/10)11!! 


=  600c(101n  10  —  9)  c  %  0.00012. 


(b)  The  marginal  distribution  is 


, 100  / ioo  —  x\  r  / ioo  —  x\  i 

ft(i)  =  l  c  (~H =  c  (— H  A 


100 


40 


/ 100  —  x\  / 100  —  x\  /100-JC 

=  100c  (  — - - j  -  40c  (  — - - J  «  0.00713 


for  10  <  x  <  100. 


(c)  To  determine  P(X  >  75),  we  need  the  cumulative  marginal  distribution  of  X: 


Fx(x)=  I  fx(t)dt=  f  0.00713 
— oo  J 10 

•*  0.00713 


-  0.00713  d?  =  [0.713  ln(f)  -  0.00713] 


10 


10  t 

=  0.713  ln(jc)  -  0.00713v  -  0.00713  ln(10)  +  0.00713  •  10. 


Now  we  can  calculate 


P(X  >  75)  =  1  -  P(X  <  75)  =  1  -  Fx( 75)  =  1  -  0.973  ^  2.7  %. 
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(d)  The  conditional  distribution  is 


fy\x(x,y) 


fix,  y) 
fix) 


c(i0CL-v) 

60c  (%) 


1 

60' 


Solution  to  Exercise  7.10  If  we  evaluate  the  expectation  with  respect  to  Y,  then  both 
H  and  a  can  be  considered  to  be  constants.  We  can  therefore  write 

E(F)  =  E  f  -  V)- 

Since  E(X)  =  /i,  it  follows  that  E(Y)  =  0.  The  variance  is 

(X-ii 
Var (F)  =  Var  ( - - 

Applying  Var  (a  +  bX)  =  b2 Var(X)  to  this  equation  yields  a  =  /i,  b  =  ^  and  there¬ 
fore 

1  a2 

Var  (Y)  =  -rVar(X)  =  =  1. 

c JZ  (JA 


Chapter  8 


Solution  to  Exercise  8. 1  The  random  variable  X:  “number  of  packages  with  a  toy”  is 
binomially  distributed.  In  each  of  n  =  20  “trials”,  a  toy  can  be  found  with  probability 


(a)  We  thus  get 


P(X  =  4) 


(b)  Similarly,  we  calculate 


P(X  =  0) 


(c)  This  question  relates  to  a  hypergeometric  distribution:  there  are  N  =  20  pack¬ 
ages  with  M  =  3  packages  with  toys  and  N  —  M  =  17  packages  without  a  toy. 
The  daughter  gets  n  =  5  packages  and  we  are  interested  in  P  (X  =  2) .  Hence, 
we  get 


P(X  =  2) 


N 


20 


0.13. 
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Solution  to  Exercise  8.2  Given  X  ~  Af(42.1,  20. 82),  we  get: 


P(X  >  50)  =  1  -  P(X  <  50)  =  1  - 
=  1  -  0(0.37)  «  0.35. 


( 


50-42.1 

20.8 


) 


We  obtain  the  same  results  in  R  as  follows: 
l-pnorm (50 ,42.1,20.8) 


#(30  <  X  <  40)  =  P(X  <  40)  -  P(X  <  30) 

40-  42.1  \  /  30  — 


/  40  —  42.1  \  /  30  —  42.1  \ 

V  20.8  /  ^  V  20.8  / 


20.8  /  r\  20. 

=  0(— 0.096)  -  0(— 0.577)  =  1  -  0.538  -  1  +  0.718 


18%. 


We  would  have  obtained  the  same  results  in  R  using: 


pnorm(40,42. l,20.8)-pnorm(30,42. 1,20.8) 


IT 


i 


Solution  to  Exercise  8.3  The  random  variable  X  follows  a  discrete  uniform  distrib¬ 
ution  because  pt  =  f°r  each  xi  •  The  expectation  and  variance  are  therefore 


E(X)  = 


Var(X)  = 


k+\  12  +  1 


1  9 

—  (122  -  1)  »  11.92. 
12 


Solution  to  Exercise  8.4  Each  guess  is  a  Bernoulli  experiment  where  the  right 
answer  is  given  with  a  probability  of  50  %.  The  number  of  correct  guesses  therefore 
follows  a  binomial  distribution,  i.e.  X  ~  #(10;  0.5).  The  probability  of  giving  the 
right  answer  at  least  8  times  is  identical  to  the  probability  of  not  being  wrong  more 
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than  2  times.  We  can  thus  calculate  P(X  >  8)  as  P(X  <  2): 

P(X  =  0)  =  ( 

1o°j0.5°(l  -  0.5)10  %  0.000977 

P(X=l)  = 

PAoVd  -  0.5)9  %  0.009766 

P(X  =  2)  =  | 

r°  J0.52(l  -  0.5)8  0.043945. 

This  relates  to 

P(X  <  2)  =  P(X  =  0)  +  P(X  =  1)  +  P(X  =  2) 

=  0.000977  +  0.009766  +  0.043945  «  0.0547. 

We  would  have  obtained  the  same  results  in  R  using: 

pbinom(2 ,10,0.5) 
l-pbinom(7 ,10,0.5) 

R 

Solution  to  Exercise  8.5 


(a)  It  seems  appropriate  to  model  the  number  of  fused  bulbs  with  a  Poisson  dis¬ 
tribution.  We  assume,  however,  that  the  probabilities  of  fused  bulbs  on  two 
consecutive  days  are  independent  of  each  other;  i.e.  they  only  depend  on  A  but 
not  on  the  time  t. 

(b)  The  arithmetic  mean  is 

1  52 

x  =  —(0  +  l-  8  +  2-  8  +  --  -  +  5-l)  =  —  =  1.7333 
30  30 

which  means  that,  on  an  average,  1.73  bulbs  are  fused  per  day.  The  variance  is 

s2  =  —  (0  +  l2  •  8  +  22  •  8  +  •  •  •  +  52  •  1)  -  1.73332 
30 

142 

= -  3.0044  =  1.72889. 

30 

We  see  that  mean  and  variance  are  similar,  which  is  an  indication  that  the 
choice  of  a  Poisson  distribution  is  appropriate  since  we  assume  E(X)  =  A  and 
Var(X)  =  A. 

(c)  The  following  table  lists  the  proportions  (i.e.  relative  frequencies  /))  together 
with  the  probabilities  P(X  =  x)  from  a  Po(  1.7 3) -distribution.  As  a  reference, 
we  also  list  the  probabilities  from  a  Po (2) -distribution  since  it  is  not  practically 
possible  that  1.73  bulbs  stop  working  and  it  may  hence  be  an  option  to  round 
the  mean. 
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fi 

Po(  1.73) 

Po(  2) 

P(X  =  0) 

0.2 

0.177 

0.135 

P(X  =  1) 

0.267 

0.307 

0.27 

P(X  =  2) 

0.267 

0.265 

0.27 

P(X  =  3) 

0.167 

0.153 

0.18 

P(X  =  4) 

0.067 

0.067 

0.09 

P{X  =  5) 

0.033 

0.023 

0.036 

One  can  see  that  observed  proportions  and  expected  probabilities  are  close 
together  which  indicates  again  that  the  choice  of  a  Poisson  distribution  was 
appropriate.  Chapter  9  gives  more  details  on  how  to  estimate  parameters,  such 
as  A,  from  data  if  it  is  unknown. 


(d)  Using  A  =  1.73,  we  calculate 


A* 


P(X  >  5)  =  1  -  P(X  <  5)  =  1  -  £  -exp(-A) 

I  • 


=  1  —  exp(— 1.73) 


=  1-0.99  =  0.01 


i= o 

73°  1 .73 1 

+  —  + 


0! 


1! 


+ 


Thus,  the  bulbs  are  replaced  on  only  1  %  of  the  days. 

(e)  If  X  follows  a  Poisson  distribution  then,  given  Theorem  8.2.1,  Y  follows  an 
exponential  distribution  with  A  =  1.73. 


(f)  The  expectation  of  an  exponentially  distributed  variable  is 


E(Y)  = 


1 

A 


1 

L73 


0.578. 


This  means  that,  on  average,  it  takes  more  than  half  a  day  until  one  of  the  bulbs 
gets  fused. 


Solution  to  Exercise  8.6 


(a)  Let  X  be  a  random  variable  describing  “the  number  v  of  winning  tickets 
among  n  bought  tickets”;  then  X  follows  the  hypergeometric  distribution 
X  ~  H(n,  500,  4000).  We  need  to  determine  n  for  the  conditions  specified. 
We  are  interested  in 


P(X  >  3)  =  1  -  P(X  =  2)-  P(X  =  1)  -  P(X  =  0). 


Using  the  PMF  of  the  hypergeometric  distribution 


P(X  =x) 


N 


9 
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this  equates  to 

/500\  /4000— 500\  /500\  /4000-500\  /500\  /4000-500 

p(y  ^  ^  _  i  V  2  /  V  n-2  )  V  1  A  n-l  )  U  /l  n 

_3)  1  (4000)  (4000)  (4000) 

We  have  the  following  requirement: 


(500)  (4000-500)  (500)  (4000-500) 

(4000)  (4000) 


/500\  /4000-500 
V  0  /  V  n 

(T) 


)  ! 

-  >  0.99. 


To  solve  this  equation,  we  can  program  this  function  for  P(X  >  3;  n)  in  R  and 
evaluate  it  for  different  numbers  of  tickets  sold,  e.g.  between  50  and  100  tickets: 


raffle  <-  function(n){ 

p  <-  l-((choose(500,2)*choose(3500,n-2))/(choose(4000,n))) 
- ( (choose (500 , 1) *choose (3500 , n-l) ) / (choose (4000 ,n) ) ) 

- ( (choose (500 , 0) *choose (3500 ,n) ) / (choose (4000 ,n) ) ) 
return(p) 


} 

raffle (50 : 100) 
raffle (63: 64) 


The  output  shows  that  at  least  64  tickets  need  to  be  bought  to  have  a  99% 
guarantee  that  at  least  three  tickets  win.  This  equates  to  spending  €  96. 

(b)  We  can  plot  the  function  as  follows: 

nb  <-  seq(l:75) 

plot (nb , tombola(nb) , type= ' 1 ' ) 


Figure  B.16  shows  the  relationship  between  the  number  of  tickets  bought  and 
the  probability  of  having  at  least  three  winning  tickets. 

(c)  The  solution  of  (a)  shows  that  it  is  well  worth  taking  part  in  the  raffle:  Marco 
pays  €96  and  with  a  probability  of  99  %  and  he  wins  at  least  three  prizes  which 
are  worth  €142  •  3  =  426.  More  generally,  the  money  generated  by  the  raffle 
is  €1.50  x  4000  =  6000,  but  the  prizes  are  worth  €142  •  500  =  71,  000.  One 
may  suspect  that  the  company  produces  the  appliances  much  more  cheaply  than 
they  are  sold  for  and  is  thus  so  generous. 
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Fig.  B.1 6  Probability  to  have  at  least  three  winning  tickets  given  the  number  of  tickets  bought 


Solution  to  Exercise  8.7  The  probability  of  getting  a  girl  is  p  =  1—0.5122  = 
0.4878. 


(a)  We  are  dealing  with  a  geometric  distribution  here.  Since  we  are  interested  in 
P(X  <  3),  we  can  calculate: 

P(X  =  1)  =  0.4878 

P(X  =  2)  =  0.4878(1  -  0.4878)  =  0.2498512 

P(X  =  3)  =  0.4878(1  -  0.4878)2  =  0.1279738 

P(X  <  3)  =  P(X  =  1)  +  P(X  =  2)  +  P(X  =  3)  =  0.865625. 

We  would  have  obtained  the  same  result  in  R  using: 
pgeom(2 ,0 .4878) 


Note  that  we  have  to  specify  “2”  rather  than  “3”  for  v  because  R  takes  the  number 
of  unsuccessful  trials  rather  the  number  of  trials  until  success. 


(b)  Here  we  deal  with  a  binomial  distribution  with  k  =  2  and  n  —  4.  We  can  calculate 
P{X  =  2)  as  follows: 


P(X  =  k ) 


Pk(  i  -  p)n~k 
0.48782  •  (1  - 


0.4878)2  =  0.3745536. 


R  would  have  given  us  the  same  result  using  the  dbinom(2 ,4 , 0 . 4878)  com¬ 
mand. 
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Solution  to  Exercise  8.8 


(a)  The  random  variable  Y  follows  a  Poisson  distribution,  see  Theorem  8.2.1  for 
more  details. 

(b)  The  fisherman  catches,  on  average,  3  fish  an  hour.  We  can  thus  assume  that  the 
rate  A  is  3  and  thus  E(T)  =  A  =  3.  Similarly,  E(X)  =  j  =  ^  which  means  that 
it  takes,  on  average,  20  min  to  catch  another  fish. 

(c)  Using  the  PDF  of  the  Poisson  distribution,  we  get: 

35 

P(Y  =  5)  =  —  exp(-3)  =  0.1  =  10% 

5 . 

3° 

P(Y  <  1)  =  P(Y  =  0)  =  —  exp (—3)  0.0498  «  5  %. 

0 . 

We  would  have  obtained  the  same  results  in  R  using  the  dpois(5,3)  and 
dpois(0,3)  commands. 


Solution  to  Exercise  8.9  The  random  variable  X  =  “choice  of  dessert”  follows  a 
multinomial  distribution.  More  precisely,  X\  describes  whether  chocolate  brownies 
were  chosen,  X2  describes  whether  yoghurt  was  chosen,  X3  describes  whether  lemon 
tart  was  chosen  and  X  =  {X\,  X2,  X3}. 


(a)  Using  the  PMF  of  the  multinomial  distribution,  we  get 

n ! 

P(X  1  =  n  1,  X2  =  n2,  . . . ,  Xk  =  nk)  = 


n\\n2\ 


nk, 


n  1  ilk 

■Pi  ■■■Pk 


P(Xl=2,  X2  =  1,X3=2) 


— —  •  0.22  •  0.31  •  0.52 
2!  1  !2! 

9%. 


We  would  have  obtained  the  same  results  in  R  as  follows: 


dmultinom(c (2 , 1 , 2) ,prob=c (0 . 2 , 0 . 3 , 0 . 5) ) 


(b)  The  probability  of  choosing  lemon  tart  for  the  first  two  guests  is  1 .  We  thus  need 
to  determine  the  probability  that  3  out  of  the  remaining  3  guests  order  lemon 
tart: 

3! 

P(X  1  =  0,  X2  =  0,  X3  =  3)  =  — —  •  0.2°  .  0.3°  •  0.53 

0!0!3! 

=  12.5%. 

Using  dmultinom  (c  (0 ,0,3)  ,prob=c  (0.2, 0.3, 0.5))  in  R,  we  get  the  same 
result. 
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(c)  The  expectation  vector  is 


E(X)  =  (np i, . . . ,  npk )  =  (20  •  0.2,  20  •  3,  20  •  0.5)  =  (4,  6,  10). 


This  means  we  expect  4  guests  to  order  brownies,  6  to  order  yoghurt,  and  10  to 
order  lemon  tart.  The  covariance  matrix  can  be  determined  as  follows: 


Co  v(Xi,Xj) 


npj(l  -  pi)  if/ =7 
-npipj  if  /  ^  j. 


Using  n  =  20,  p\  =  0.2,  y>2  =  0.3  and  y>3  =  0.5,  we  obtain  the  covariance 
matrix  as: 


/  3.2  -1.2  -2\ 
-1.2  4.2  -3 

V  -2  -3  5  ) 


Solution  to  Exercise  8.10 


P(S  in=P'  P(S  >  1)  •  P(W  >  1) 

=  (1  -  P(S  =  0))  •  (1  -  P(W 


=  0)) 


=  (1  -e~3—)  •  (1  -  e 


i0 


-4 


|0 


0! 


0! 


) 


0.93. 


Solution  to  Exercise  8.11 


(a)  Random  numbers  of  a  normal  distribution  can  be  generated  using  the  rnorm 
command.  By  default  p  =  0  and  a  =  1  (see  ?rnorm),  so  we  do  not  need  to 
specify  these  parameters.  We  simply  need  to  set  n  =  1000.  The  mean  of  the 
1000  realizations  can  thus  be  obtained  using  mean  (rnorm  ( 1000) ) .  We  can,  for 
example,  write  a  for  loop  to  repeat  this  process  1000  times.  An  empty  (=NA) 
vector  of  size  1000  can  be  used  to  store  and  evaluate  the  results: 

set . seed(24121980) 

R  <-  1000 

means  <-  c(rep(NA,R)) 

for(i  in  1 : R) {means [i]  <-  mean (rnorm ( 1000) ) } 
mean (means) 

[1]  -0.0007616465 
var (means) 

[1]  0.0009671311 
plot (density (means) ) 


We  see  that  the  mean  of  the  arithmetic  means  is  close  to  zero,  but  not  exactly  zero. 
The  variance  is  approximately  a2  jn  =  1/1000  =  0.001,  as  one  would  expect 
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Fig.  B.17  Kernel  density 
plot  of  the  distribution 
simulated  in  Exercise  8.11a 


from  the  Central  Limit  Theorem.  The  distribution  is  symmetric,  similar  to  a 

2 

normal  distribution,  see  Fig.  B.17.  It  follows  that  Xn  is  approximately  N(fi,  q-) 
distributed,  as  one  could  expect  from  the  Theorem  of  Large  Numbers  and  the 
Central  Limit  Theorem.  It  is  important  to  understand  that  X  is  not  fixed  but  a 
random  variable  which  follows  a  distribution,  i.e.  the  normal  distribution. 

(b)  We  can  use  the  same  code  as  above,  except  we  use  the  exponential  instead  of 
the  normal  distribution: 


means2  <-  c(rep(NA,R)) 

for(i  in  1 : R) {means2 [i]  <-  mean(rexp(1000) )} 
mean (means 2) 

[1]  1.001321 
var (means2) 

[1]  0.001056113 
plot (density (means) ) 


The  realizations  are  i.i.d.  observations.  Once  can  see  that,  as  in  a),  Xn  is  approx- 

2 

imately  N(fi,  ^-)  =  7V(1 ,  1/1000)  distributed.  It  is  evident  that  the  X/  do  not 
necessarily  need  to  follow  a  normal  distribution  for  X  to  follow  a  normal  distri¬ 
bution,  see  also  Fig.  B.18a. 

(c)  Increasing  the  number  of  repetitions  makes  the  distribution  look  closer  to  a 

normal  distribution,  see  Fig.  B.18b.  This  visualizes  that  as  n  tends  to  infinity  Xn 

2 

gets  closer  to  a  N(fi,  ^-) -distribution. 
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(a)  1,000  repetitions 


Fig.B.18  Kernel  density  plots  for  Exercises  8.1  lb  and  8.11c 


Chapter  9 


Solution  to  Exercise  9. 1 


(a)  The  exercise  tells  us  that  Ay  ~  Po(X),  i  =  1,2 , ,n.  Let  us  look  at  the  real¬ 
izations  x\ ,  X2,  •  •  • ,  xn:  under  the  assumption  of  independence,  which  we  know 
is  fulfilled  because  the  Ay’s  are  i.i.d.,  and  we  can  write  the  likelihood  function 
as  the  product  of  the  n  PMF’s  of  the  Poisson  distribution: 


n 


n 


L(0;x)  =  L  f :  0)  =  H 


\Xi 


-A 


i= 1 


i= 1 


• ! 


It  is  better  to  work  on  a  log- scale  because  it  is  easy  to  differentiate.  The  results 
are  identical  no  matter  whether  we  use  the  likelihood  function  or  the  log- 
likelihood  function  because  the  log  transformation  is  monotone  in  nature.  The 
log-likelihood  function  is: 


In  L  =  Xi  In  A 


ln(xi!  •  •  •  xn\)  —  n A. 


Differentiating  with  respect  to  A  yields 


9  In  L 

d\ 

which  gives  us  the  ML  estimate: 


i 

Xi  —  n  =  0 


Xi  =  x. 


We  need  to  confirm  that  the  second  derivative  is  <  0  at  A  =  x;  otherwise,  the 
solution  would  be  a  minimum  rather  than  a  maximum.  We  get 


d2  In  L  1  ^ 

~axr  =  “v  L 


n 
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It  follows  that  the  arithmetic  mean  x  =  A  is  the  maximum  likelihood  estimator 
for  the  parameter  A  of  a  Poisson  distribution. 

(b)  Using  the  results  from  (a)  we  can  write  the  log-likelihood  function  for  x\  = 
4,  X2  =  3,  X3  =  8,  X4  =  6,  X5  =  6  as: 

In  L  =  27  In  A  -  ln(4!  3!  8!  6!  6!)  -  5A. 

because  =  27.  We  can  write  down  this  function  in  R  as  follows: 


MLP  <-  funct ion ( lambda) { 

27*log (lambda)  -  log(f actorial (4) * . . . *f actorial (6) )  - 
5* lambda 
} 


The  function  can  be  plotted  using  the  curve  command: 


curve (MLP,  from=0,  to=10) 


Figure  B.19  shows  the  log-likelihood  function.  It  can  be  seen  that  the  function 
reaches  its  maximum  at  A  =  5.4. 


(c)  Using  (a)  we  can  write  the  likelihood  function  as 


n 


n 


XXi 


L(0;  x)  =  FI  f(xi\  9)  =  ]~7  —  e  A  = 

A  ±  A  A  Xi ! 


i  —  \ 


i  =  1 


n  xi- 


-n  A 


= 


-n\ 


1 


“V 


g(t,  A) 


IU2 

h(x 


Fig.  B.19  Illustration  of  the 
log-likelihood  function 
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This  means  T  =  X/=i  xi  is  sufficient  for  A.  The  arithmetic  mean,  which  is 
the  maximum  likelihood  estimate,  is  a  one-to-one  function  of  T  and  therefore 
sufficient  too. 


Solution  to  Exercise  9.2 

(a)  The  probability  density  function  of  a  normal  distribution  equates  to 

...  1  (  (x-n)2\ 


f(x)  =  — —  exp 

(TV  27T 


2a2 


withoo  <  v  <oo,—  oo</i<oo,  a2  >  0.  The  likelihood  function  is  therefore 


L(x \,X2,  ...,xn\fi,a2)  = 


(vb)  exp(_l 


(xj  -  n)‘ 

2a2 


To  find  the  maximum  of  L(x\ ,  X2,  . . . ,  xn  |/i,  a2),  it  is  again  easier  to  work  with 
the  /<?g-likelihood  function,  which  is 


2  n  n  2  v  i  ( (xi  /x)J 

l  =  \nL(xi,x2,  ...,xn\fjL,  a  )  =  --ln27r-  -  In  cr  I — 

/— l  ' 


Assuming  a2  to  be  1,  differentiating  the  log-likelihood  function  with  respect  to 
/i,  and  equating  it  to  zero  gives  us 


=  2 


iM=°  ^ 

i  =  1  V  7  i  =  \ 


The  ML  estimate  is  therefore  jl  =  x. 

(b)  Looking  at  the  differentiated  log-likelihood  function  in  (a)  shows  us  that  the  ML 
estimate  of  /i  is  always  the  arithmetic  mean,  no  matter  whether  a2  is  1  or  any 
other  number. 

(c)  Differentiating  the  log-likelihood  function  from  (a)  with  respect  to  a2  yields 


dl 

da2 


n  1 


2  a2  2a 


1  n 

+  77X  (*/  -  M)2  =  O' 


_  07 

Using  jl  =  x  we  calculate  -pr  =  0 


■2  =  -  yy.v,  -  A)2  =  -  y'.ixj  -  *)2- 

m  •  •  vi  •  * 


Since  the  parameter  0  =  (/i,  a2)  is  two-dimensional,  it  follows  that  one  needs  to 
solve  two  ML  equations,  where  /)ml  is  estimated  first,  and  a^L  second  (as  we 
did  above).  It  follows  further  that  one  needs  to  look  at  the  positive  definiteness 
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of  a  matrix  (the  so-called  information  matrix)  when  checking  that  the  second- 
order  derivatives  of  the  estimates  are  positive  and  therefore  the  estimates  yield 
a  maximum  rather  then  a  minimum.  However,  we  omit  this  lengthy  and  time- 
consuming  task  here. 


Solution  to  Exercise  9.3  The  probability  density  function  of  U(0,  6)  is 

fix)  =  -  if  0  <  x  <  6  and  0  otherwise. 

J  0 

Note  that  this  equates  to  the  PDF  from  Definition  8.2.1  for  a  =  0  and  b  =  0.  The 
likelihood  function  is  therefore 


L(x i,x2,  ...,x„\0)  = 


G) 


n 


if  0  <  Xi  <  0  and  0  otherwise. 


One  can  see  that  L{x\ ,  X2, . . . ,  xn \6)  increases  as  6  decreases.  The  maximum  of 
the  likelihood  function  is  therefore  achieved  for  the  smallest  valid  6.  In  particular,  0  is 
minimized  when  0  >  max  (x\,  X2,  . . . ,  xn)  =  X(n).  This  follows  from  the  definition 
of  the  PDF  which  requires  that  0  <  xi  <9  and  therefore  9  >  v/ .  Thus,  the  maximum 
likelihood  estimate  of  0  is  X(w),  the  greatest  observed  value  in  the  sample. 


Solution  to  Exercise  9.4 


(a)  Tn(X)  is  unbiased,  and  therefore  also  asymptotically  unbiased,  because 

(7  29)  1  1 

E(Tn(X))  =  E(nXm[n)  {  =  )n—  =  -=^. 

n\  A 

Similarly,  Vn(X)  is  unbiased,  and  therefore  also  asymptotically  unbiased, 
because 


E(Vn(X))  =  E 


(129)  1 
n 


XE(X‘> 

i  =  1 


1  1 

-n-  =  [i. 
n  \ 


(b)  To  calculate  the  MSE  we  need  to  determine  the  bias  and  the  variance  of  the 
estimators  as  we  know  from  Eq.  (9.5).  It  follows  from  (a)  that  both  estimators 
are  unbiased  and  hence  the  bias  is  0.  For  the  variances  we  get: 


Var(7 yn(X))  =  Var(nXmin) 


(723)  2xj  ,v  \  _  2 

—  n  Var(Xmin)  —  n 


1 


n2  A2 


Var(I4(X))  =  Var 


(7.33) 


2^Var(X,-) 

i=\ 


1  1 

^ H A^ 


n 
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Since  the  mean  squared  error  consists  of  the  sum  of  the  variance  and  squared 
bias,  the  MSE  for  Tn(X)  and  Vn(X)  are  fr  and  fi2,  respectively.  One  can 
see  that  the  larger  n ,  the  more  superior  Vn(X)  over  Tn(X)  in  terms  of  the  mean 
squared  error.  In  other  words,  Vn(X)  is  more  efficient  than  Tn(X)  because  its 
variance  is  lower  for  any  n  >  1 . 


(c)  Using  the  results  from  (b),  we  get 

1  ? 

lim  MSE(V„(X))  =  lim  -/?  =  0. 

n^oo  n^oo  n 


This  means  the  MSE  approaches  0  as  n  tends  to  infinity.  Therefore,  Vn  ( X )  is  MSE 
consistent  for  /i.  Since  Vn(X)  is  MSE  consistent,  it  is  also  weakly  consistent. 


Solution  to  Exercise  9.5 


(a)  The  point  estimate  of  /i  is  v  which  is 


1 

—  (450  +  •  •  •  +  790)  =  667.92. 


The  variance  of  a2  can  be  estimated  unbiasedly  using  s2: 


1 


yVy,  -  x)2 
i  —  \ 


l  ? 

=  —((450  -  667. 92)2  +  •  •  •  +  (790 
23 


667. 92)2)  %  18, 035. 


(b)  The  variance  is  unknown  and  needs  to  be  estimated.  We  thus  need  the  t- 
distribution  to  construct  the  confidence  interval.  We  can  determine  ^23; 0.975  ~ 
2.07  using  qt  (0.975, 23)  or  Table  C.2  (though  the  latter  is  not  detailed  enough), 
a  =  0.05,  v  =  667.97  and  a2  =  18,  035.  This  yields 


Il(X )  =  X  —  tn- l;l-a/2  • 
—  X  +  tji —  1 ;  1  — ck/2  ■ 


s  Vl8,035  ^ 

=  667.92  —  t2 3-0  975  ■ - - — —  ~  611.17, 

’  V24 

j  V18,035 

667.92  —  t2 3-0  975  ■ - 7= —  ^  724.66. 

’  V24 


The  confidence  interval  for  /i  is  thus  [611.17;  724.66]. 


(c)  We  can  reproduce  these  results  in  R  as  follows: 

eland  <-  c (450, 730, 700, 600, 620, ,790) 
t . test (eland) $conf . int 


Appendix  B:  Solutions  to  Exercises 


389 


Solution  to  Exercise  9.6 


Let  us  start  with  the  confidence  interval  for  the  “Brose  Baskets  Bamberg”.  Using 
^15; 0.975  =  2.13  1  4  (qt  (0 . 975 , 15)  or  Table  C. 2)  and  a  =  0.05,  we  can  determine 
the  confidence  interval  as  follows: 

s  ^  7.047  „  „ 

Il(Ba)  =  X  —  U  — 1;  1— a/2  •  — P  =  199.06  —  ^15;0.975  ‘  > —  =  195.305, 

*Jn  V16 

j  ^  7.047 

lu(Ba)  =  X  +  tn—\\  \—a/2  '  —  =  199.06  +  L5;0.975  •  > —  =  202.815. 

V16 


Thus,  we  get  [195.305;  202.815]. 

For  the  “Bayer  Giants  Leverkusen”,  we  use  /33;o.975  =  2.1604  to  get 

j  ^  9.782 

7/(L)  =  V  -  tn- l-l-a/2  •  — P  =  196  ^13;0.975  ‘  t —  =  190.352, 

\/n  V14 

j  ,  9.782 

4(L)  =  V  +  tn- l-l-a/2  •  — P  =  196  +  ^13;0.975  ‘  t —  =  201.648. 

\/n  V14 

This  leads  to  a  confidence  interval  of  [190.352;  201.648]. 

For  “Werder  Bremen”,  we  need  to  use  the  quantile  £22, 0.975  =  2.07  39  which  yields 
a  confidence  interval  of 

j  „  5.239  ^  _ 

hi.Br)  =  x  —  tn- i;i-a/2  •  — p  =  187.52  —  *22;  0.975  ■  — 7=-  =  185.255, 


V23 

j  „  5.239 

lu(Br)  =  X  +  G-l;l-a/2  •  =  187.25  +  *2 2;0.975  •  > —  =  189.786. 

'v  ^7  v  23 

The  interval  is  therefore  [185.255;  189.786]. 


•  The  mean  heights  of  the  basketball  teams  are  obviously  larger  than  the  mean 
height  of  the  football  team.  The  two  confidence  intervals  of  the  basketball  teams 
overlap,  whereas  the  intervals  of  the  football  team  with  the  two  basketball  teams 
do  not  overlap.  It  is  evident  that  this  indicates  that  the  height  of  football  players  is 
substantially  less  than  the  height  of  basketball  players.  In  Chap.  10,  we  will  learn 
that  confidence  intervals  can  be  used  to  test  hypotheses  about  mean  differences. 


Solution  to  Exercise  9. 7 


(a)  Usings 
that  Xi  = 


=  98,  we  calculate  an  unbiased  point  estimate  for  p  using  v  =  p.  Note 
1  if  the  wife  has  to  wash  the  dishes. 


„  1  ^  1 

P=n^Xi  =  ~ 
i  =  1 


59  = 


59 


98 


98 


0.602. 
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(b)  Since  np(l  —  p)  —  98  •  0.602  •  0.398  =  23.48  >  9  and  p  is  sufficiently  large, 
we  can  use  the  normal  approximation  to  calculate  a  confidence  interval  for  p. 
Using  zi-a/2  =  zo.975  =  1-96,  we  obtain 


Ii(X )  =  0.602  -  1.96 
4(2 0  =  0.602  +  1.96 


0.602  •  0.398 
98 

0.602  •  0.398 
98 


0.505, 

0.699. 


This  yields  a  confidence  interval  of  [0.505,  0.699].  Note  that  the  confidence 
interval  does  not  contain  p  =  0.5  which  is  the  probability  that  would  be  expected 
if  the  coin  was  fair.  This  is  a  clear  sign  that  the  coin  might  be  unfair. 


(c)  If  the  coin  is  fair,  we  can  use  p  =  0.5  as  our  prior  judgement.  We  would  then 
need 


n  > 


> 


'Zl-g/2 

A 

'  1.96 
0.005 


-i2 


P(  1 


-i2 


OS 


p) 

38,416 


dinners  to  get  the  desired  precision — which  is  not  possible  as  this  would  con¬ 
stitute  a  time  period  of  more  than  100  years.  This  shows  that  the  expectation  of 
such  a  high  precision  is  unrealistic  and  needs  to  be  modified.  However,  as  the 
results  of  (b)  show,  the  coin  may  not  be  fair.  If  the  coin  is  indeed  unfair,  we  may 
have,  for  example,  p  =  0.6  and  1  —  p  =  0.4  which  gives  a  smaller  sample  size. 
We  can  thus  interpret  the  sample  size  as  a  conservative  estimate  of  the  highest 
number  of  dinners  needed. 


Solution  to  Exercise  9.8  If  a  student  fails  then  jt*  =  1 .  We  know  that  ^  v*  =  11  and 
n  =  104. 

(a)  Using  x  =  p  as  point  estimate,  we  get  p  =  ^  ~  0.106  =  10.6  %.  Using 
zi-a/2  =  £0.975  =  1-96,  we  can  calculate  the  confidence  interval  as 

/ 0.106-0.894 

0.106+  1.96- ,/ -  =  [0.047;  0.165]. 

V  104 

Using  R  we  get: 

binom. test (11 , 104) $conf . int 
[1]  0.05399514  0.18137316 

This  result  is  different  because  the  above  command  does  not  use  the  normal 
approximation.  In  addition,  p  is  rather  small  which  means  that  care  must  be 
exercised  when  using  the  results  of  the  confidence  interval  with  normal  approx¬ 
imation  in  this  context. 
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(b)  The  point  estimate  of  10.6  %  is  substantially  higher  than  3.2  %.  The  lower  bound 
confidence  interval  is  still  larger  than  the  proportion  of  failures  at  county  level. 
This  indicates  that  the  school  is  doing  worse  than  most  other  schools  in  the 
county. 

Solution  to  Exercise  9.9 

(a)  Whether  the  i th  household  has  switched  on  the  TV  and  watches  “Germany’s 
next  top  model”  (GNTM)  relates  to  a  random  variable  X /  with 

Xi  =  1  :  if  TV  switched  on  and  household  watching  GNTM 
X i  =  0  :  if  TV  switched  off  or  household  watches  another  show. 

It  follows  that  X  =  X?=°i°  Xi  is  the  random  variable  describing  the  number  of 
TVs,  out  of  2500  TVs,  which  are  switched  on  and  show  GNTM.  Since  the  Xi ’s 
can  be  assumed  to  be  i.i.d.,  we  can  say  that  X  follows  a  binomial  distribution, 
i.e.  X  ~  5(2500;  p)  with  p  unknown.  The  length  of  the  confidence  interval  for 

/ 5(1  —  P)  ,  , 

P  ~  Zl-a/2y  - - - ^  P  ±  Zl-a/2 

is 

L  =  2zi- 

Unfortunately,  p(l  —  p)  is  unknown  but  p(l  —  p)  <  \  because  the  maximum 
value  for  p{\  —  p)  is  0.25  if  p  =  0.5.  Hence 

/T  1.96 

L  <  2zi-a/2\  -  =  ~i=  =  0.0392. 

V  n  V2500 

This  means  the  precision  is  half  the  length,  i.e.  ±0.0196  =  ±1.96  %,  in  the  worst 
case. 

(b)  There  is  the  danger  of  selection  bias.  A  total  of  500  households  refused  to 
take  part  in  the  study.  It  may  be  that  their  preferences  regarding  TV  shows  are 
different  from  the  other  2500  households.  For  example,  it  may  well  be  possible 
that  those  watching  almost  no  TV  refuse  to  be  included;  or  that  those  watching 
TV  shows  which  are  considered  embarrassing  by  society  are  refusing  as  well. 
In  general,  missing  data  may  cause  point  estimates  to  be  biased,  depending  on 
the  underlying  mechanism  causing  the  absence. 
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Solution  to  Exercise  9.10 


Using  zi— a/2  =  1-96  and  a  =  0.233,  we  obtain  an  optimal  sample  size  of  at 
least 


ft  opt  — 


"oU-a/2CT0“ 

2 

r  1.96-0.2331 
0 

L  A  J 

0.2 

20.85. 


To  calculate  a  confidence  interval  with  a  width  of  not  more  than  0.2  s,  the  results 
of  at  least  21  athletes  are  needed. 


(b)  The  sample  size  is  30.  Thus,  the  confidence  interval  width  should  be  smaller 
than  0.2  s.  This  is  indeed  true  as  the  calculations  show: 

0  233 

[10.93  ±  *0. 975;29  =  [10.84;  11.02], 


730 


2.045 


The  width  is  only  0.18  s. 


(c)  If  we  calculate  a  80  %  confidence  interval,  then  the  lower  confidence  limit 
corresponds  to  the  running  time  which  is  achieved  by  only  10  %  of  the  athletes 
(of  the  population).  Using  /o.9;29  ~  1.31  we  get 

0  233 

[10.93  ±  1.31  •  =  [10.87;  10.99]. 

730 

The  athlete’s  best  time  is  thus  below  the  lower  confidence  limit.  He  is  among 
the  top  10  %  of  all  athletes,  using  the  results  of  the  confidence  interval. 


Solution  to  Exercise  9.11 


(a)  The  odds  ratio  is 


OR  = 


163  •  All 


1.08. 


475  •  151 

This  means  the  chances  that  a  pizza  arrives  in  time  are  about  1 .08  times  higher 
for  Laura  compared  with  Melissa.  To  calculate  the  95  %  confidence  interval,  we 
need  %  =  ln(1.08)  ~  0.077,  z\-a/2  ~  1-96,  and 


a0o  ~ 


/  i 

1 

1 

i  v 

(  „ 

T  T 

T  ) 

V  163 

475 

151 

All  1 

=  0.13. 


The  interval  for  the  log  odds  ratio  is 

[ln(l .08)  ±  1.96  •  0.13]  «  [-0.18;  0.33] . 


Exponentiating  the  interval  gives  us  the  95  %  confidence  interval  for  the  odds 
ratio  which  is  [0.84;  1 .39].  This  indicates  that  the  odds  of  Laura’s  pizzas  arriving 
earlier  than  Melissa’s  are  not  much  different  from  one.  While  the  point  estimate 
tells  us  that  Laura’s  pizzas  are  delivered  1.08  times  faster,  the  confidence  interval 
tells  us  that  there  is  uncertainty  around  this  estimate  in  the  sense  that  it  could 
also  be  smaller  than  1  and  Melissa  may  not  necessarily  work  more  slowly  than 
Laura. 
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(b)  We  can  reproduce  the  results  in  R  by  attaching  the  pizza  data,  creating  a  cat¬ 
egorical  delivery  time  variable  (using  cut)  and  then  applying  the  oddsratio 
command  from  the  library  epitools  onto  the  contingency  table: 

attach(pizza) 

timecat  <-  cut (time,  breaks=c (-1 , 30 , 150) ) 
library (epitools) 

oddsratio (table (timecat , operator) ,  method= ' wald ' ) 


Chapter  10 


Solution  to  Exercise  10.1  A  type  I  error  is  defined  as  the  probability  of  rejecting  Ho 
if  Ho  is  true.  This  error  occurs  if  A  thinks  that  B  does  confess,  but  B  does  not.  In  this 
scenario,  A  confesses,  goes  free,  and  B  serves  a  three-year  sentence.  A  type  II  error 
is  defined  as  the  probability  of  accepting  Ho,  despite  the  fact  that  Ho  is  wrong.  In 
this  case,  B  does  confess,  but  A  does  not.  In  this  scenario,  B  goes  free  and  A  serves 
a  three-year  sentence.  A  type  II  error  is  therefore  worse  for  A.  With  a  statistical  test, 
we  always  control  the  type  I  error,  but  not  the  type  II  error. 


Solution  to  Exercise  10.2 


(a)  The  hypotheses  are 

Ho  :  ti  =  100  versus  H\  :  /x  7^  100. 


(b)  It  is  a  one-sample  problem  for  /1:  thus,  for  known  variance,  the  Gauss  test  can 
be  used;  the  t-test  otherwise,  see  also  Appendix  D.  Since,  in  this  exercise,  a 
is  assumed  to  be  known  we  can  use  the  Gauss  test;  i.e.  we  can  compare  the 
test  statistic  with  the  normal  distribution  (as  opposed  to  the  t -distribution  when 
using  the  £-test).  The  sample  size  is  small:  we  must  therefore  assume  a  normal 
distribution  for  the  data. 


(c)  To  calculate  the  realized  test  statistic,  we  need  the  arithmetic  mean  v  =  98.08. 
Then  we  obtain 


x  —  fl 

(7 


98.08  -  100 
2 


—  •  715  =  -3.72. 
2 


We  reject  Ho  if  |£(x)|  >  =  1.96.  Since  \t(x)\  =  3.72  >  1.96,  the  null 

hypothesis  can  be  rejected.  The  test  is  significant.  There  is  enough  evidence 
to  suggest  that  the  observed  weights  do  not  come  from  a  normal  distribution 
with  mean  100  g. 
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(d)  To  show  that /i  <  100,  we  need  to  conduct  a  one-sided  test.  Most  importantly,  the 
research  hypothesis  of  interest  needs  to  be  stated  as  the  alternative  hypothesis: 

Ho  :  11  >100  versus  Hi  :  //  <  100. 

(e)  The  test  statistic  is  the  same  as  in  (b):  t  (v)  =  —3.72.  However,  the  critical  region 
changes.  Ho  gets  rejected  if  t(x)  <  —z\-a  —  —1.64.  Again,  the  null  hypothesis 
is  rejected.  The  producer  was  right  in  hypothesizing  that  the  average  weight  of 
his  chocolate  bars  was  lower  than  100  g. 


Solution  to  Exercise  10.3 


(a)  We  calculate 


No  auction 

Auction 

X 

16.949 

10.995 

s 2 

2.948 

2.461 

s 

1.717 

1.569 

V 

0.101 

0.143 

Note  that  we  use  the  unbiased  estimates  for  the  variance  and  the  standard  devia¬ 
tion  as  explained  in  Chap.  9;  i.e.  we  use  1  /(n  —  1)  rather  than  l/n.  It  is  evident 
that  the  mean  price  of  the  auctions  ( fia )  is  lower  than  the  mean  non-auction 
price  (j iina),  and  also  lower  than  the  price  from  the  online  book  store.  There  is, 
however,  a  higher  variability  in  relation  to  the  mean  for  the  auction  prices.  One 
may  speculate  that  the  best  offers  are  found  in  the  auctions,  but  that  there  are  no 
major  differences  between  the  online  store  and  the  internet  book  store,  i.e. 

•  Hna  7^  €16.95, 

•  Ea  <  €16.95, 

•  f^na  >  l^a- 


(b)  We  can  use  the  t-test  to  test  the  hypotheses 

Ho  :  A Lna  =  16.95  versus  H\  :  jina  /  16.95. 
The  test  statistic  is 


t(x)  = 


x  —  fi, 


16.949  -  16.95 


n  = 


Vl4  =  -0.002. 


a  1.717 

Using  the  decision  rules  from  Table  10.2,  we  conclude  that  Ho  gets  rejected  if 
\t(x)\  >  ^13, 0.975  =  2.16.  We  can  calculate  ^13, 0.975  either  by  using  Table  C.2 
or  by  using  R  (qt  (0 . 975 , 13)).  Since  \t(x)\  =  0.002  ^  2.16,  we  keep  the  null 
hypothesis.  There  is  not  enough  evidence  to  suggest  that  the  prices  of  the  online 
store  differ  from  €16.95  (which  is  the  price  from  the  internet  book  store). 
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(c)  Using  (9.6)  we  can  calculate  the  upper  and  lower  limits  of  the  confidence  interval 
as, 


X  T  tn  —  1;  1—  a/2 


•X  tn  —  l\l—a/2 


16.949  +  ^13;0.975 
=2.16 


1.717 

a/14 


16.949  —  ^13;0.975 
=2.16 


1.717 

>/l4 


17.94 

15.96 


respectively.  The  confidence  interval  does  contain  /xo  =  16.95;  hence,  the  null 
hypothesis  /x  =  /xo  =  16.95  cannot  be  rejected.  This  is  the  same  conclusion  as 
obtained  from  the  one- sample  t- test  from  above. 


(d)  We  test  the  following  hypotheses: 

Ho  :  i±a  >  16.95 
The  realized  test  statistic  is 


versus  H\  :  pa  <  16.95. 


t(x)  = 


x  —  /I 


10.995  -  16.95 


n  = 


Vl4  =  -14.201 


a  1.569 

Table  10.2  tells  us  that  Ho  gets  rejected  if  t(x)  <  —  *13,0.95.  Since  —14.201  < 
—2.16  we  reject  Ho.  The  test  confirms  our  hypothesis  that  the  mean  auction 
prices  are  lower  than  €16.95,  i.e.  the  price  from  the  book  store. 


(e)  We  need  to  conduct  two  tests:  the  two-sample  t- test  for  the  assumption  of  equal 
variances  and  the  Welch  test  for  the  assumption  of  different  variances. 


(i)  Two-sample  t-test : 


Ho  :  Una  <  da  versus  H 1  :  iina  >  pa . 


To  calculate  the  test  statistic  (10.4),  we  need  to  determine  the  pooled  sample 
variance: 


.v  = 


(nna  -  1  )sla  +  (na  -  1)5 


a 


Mna  T  2 

(14-  1)2.948 +  (14-  1)2.461 


14+14-2 


2.705 


The  test  statistic  is 


%na  %a 
S 


n 


na  ’  Wa 


Kna  T  ^a 


16.949  -  10.995 
V2.705 


5.954 

1.645 


3.621  •  V7  =  9.578. 


14-  14 
14+14 


We  reject  H0  if  t(x)  >  tnna+na-2, o.95  =  U6+.95  =  1.71.  Table  C.2  does  not 
list  the  quantile;  thus,  one  uses  R  (qt  (0 . 95 , 26) )  to  determine  the  quantile. 
Since  9.578  >  1.71,  we  reject  the  null  hypothesis.  The  test  is  significant. 
There  is  enough  evidence  to  conclude  that  the  mean  auction  prices  are  lower 
than  the  mean  non-auction  prices. 
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(ii)  Welch  test :  For  the  Welch  test,  we  calculate  the  test  statistic,  using  (10.6) 
as: 

\xna  —  Xa\  1 16.949  —  10.995 ' 


t(x)  = 


'  v2  o2 

3 na  | 

Hna  Ha 


2.948  _j_  2.461 


=  9.578. 


V  14  1  14 

To  calculate  the  critical  value,  we  need  the  degrees  of  freedom: 


v  = 


( 

(- 


(sna/nna) 


n 


na 


-  1 


14 


25.79 


948  2.461 


948/14)2  (2.461/14) 


13 


+ 


13 


We  reject  Ho  if  t(x )  >  tv, 0.95.  Using  tVi 0.95  =  1.706  (in  R  obtained  as 
qt (0.95,25.79))  and  t(x )  =  9.578,  we  reject  the  null  hypothesis.  The 
conclusions  are  identical  to  using  the  two- sample  t- test  (in  this  example). 
Interestingly,  the  two  test  statistics  are  similar  which  indicates  that  the 
assumption  of  equal  variances  in  case  (i)  was  not  unreasonable. 


(f)  The  F-test  relies  on  the  assumption  of  the  normal  distribution  and  tests  the 
hypotheses: 

Ho  ■  <?la  =  o2a  versus  Hi  :  a2na  /  a2a. 


We  can  calculate  the  test  statistic  as  described  in  Appendix  C: 


2.949 

2.461 


1.198. 


The  F -distribution  is  not  symmetric  around  0;  thus,  we  need  to  calculate  two  crit- 
ical  values:  fni-Un2-Ui-a/2  and  fni-Un2-Ua/2-  Using  R  we  get  /13, 13,0.975  = 
3.115(qf  (0.975, 13, 13))  and /13, 13, 0.025  =  3TT5  =  0.321  (qf  (0.025, 13, 13)). 
H0  is  rejected  if  t(x)  >  fni_Un2_Ui_a/2ort(x)  <  /ni-i;„2-i;a/2.Since0.321  < 
1.198  <  3.115  we  do  not  reject  Ho.  This  means  there  is  strong  evidence  that  the 
two  variances  are  equal.  This  is  also  reassuring  with  respect  to  using  the  two- 
sample  t- test  above  (which  assumes  equal  variances).  However,  note  that  testing 
the  equality  of  variances  with  the  F-test  and  then  using  the  two- sample  t- test  or 
Welch  test  based  on  the  outcome  of  the  F-test  is  not  ideal  and  not  necessarily 
correct.  In  practice,  it  is  best  to  simply  use  the  Welch  test  (rather  than  the  t- test), 
which  is  also  the  default  option  in  the  R  function  t .  test. 
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(g)  We  need  the  following  table  to  calculate  the  rank  sums: 


Value 

9.3 

9.52 

9.54 

10.01 

10.5 

10.5 

10.55 

10.59 

11.02 

11.03 

Sample 

a 

a 

a 

a 

a 

a 

a 

a 

a 

a 

Rank 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Value 

11.89 

11.99 

12 

13.79 

15.49 

15.9 

15.9 

15.9 

15.9 

15.9 

Sample 

a 

a 

a 

na 

a 

na 

na 

na 

na 

na 

Rank 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

Value 

15.99 

16.98 

16.98 

17.72 

18.19 

18.19 

19.97 

19.97 

Sample 

na 

na 

na 

na 

na 

na 

na 

na 

Rank 

21 

22 

23 

24 

25 

26 

27 

28 

We  can  calculate  the  rank  sums  as  Rna+  =  13  +  15  +  --  -  +  28  =  300  and 
Ra+  =  l+2+---  +  13  +  15  =  106,  respectively.  Thus 

o  14-15  o  14-15 

U  \  =  142  + -  106  =  195;  U2  =  142  + -  300  =  1. 

2'  2' 

With  U  =  min(195,  1)  =  1  we  can  calculate  the  test  statistic,  which  is  approx¬ 
imately  normally  distributed,  as: 

JJ  _  n  i  -Ti2  1  _  14^ 

t(x,  y )  =  2  =  2 - ~  —4.46 

n\  •  ri2  •  (n\  +  ri2  +  1)  / 14  *  14  *  29 

V  12  V  12 

Since  \  t(x,y)\  =  4.46  >  z\-aj2  =  1.96,  the  null  hypothesis  can  be  rejected.  We 
conclude  that  the  locations  of  the  two  distributions  are  shifted.  Obviously  the 
prices  of  the  auction  are  smaller  in  general,  and  so  is  the  median. 

(h)  We  can  type  in  the  data  and  evaluate  the  summary  statistics  using  the  mean, 
var,  and  sd  commands: 

na  <-  c(18.19,16.98,19.97,  ...  ,17.72)  |f^ 

a  <-  c(10.5,12.0,  9.54,...,  11.02) 

mean(na) 

mean (a) 

var (na) 


The  t .  test  command  can  be  used  to  answer  questions  (b)-(e).  For  (b)  and  (c) 
we  use 


t . test (na,mu=16 . 95) 
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One-sample  t-test 


data:  na 

t  =  -0.0031129,  df  =  13,  p-value  =  0.9976 
alternative  hypothesis:  true  mean  is  not  equal  to  16.95 
95  percent  confidence  interval: 

15.95714  17.94001 
sample  estimates: 
mean  of  x 
16.94857 

The  test  decision  can  be  made  by  means  of  either  the  p-value  (=  0.9976  >  0.05) 
or  the  confidence  interval  ([15.95;  17.94],  which  covers  16.95).  To  answer  (d) 
and  (e)  we  need  to  make  use  of  the  option  alternative  which  specifies  the 
alternative  hypothesis: 

t . test (a,mu=16 . 95 , alternative= 1  less ' ) 
t . test (na, a,  alternative= ' greater ' ) 

Note  that  the  two-sample  test  provides  a  confidence  interval  for  the  difference 
of  the  means.  Questions  (f)  and  (g)  can  be  easily  solved  by  using  the  var  .test 
and  wilcox. test  commands: 

var . test (na, a) 
wilcox .test (na, a) 


F-test  to  compare  two  variances 
data:  na  and  a 

F  =  1.198,  num  df  =  13,  denom  df  =  13,  p-value  =  0.7496 
alternative  hypothesis:  true  ratio  of  variances  not  equal  to  1 
95  percent  confidence  interval: 

0.3845785  3.7317371 
sample  estimates: 
ratio  of  variances 
1.197976 


Wilcoxon  rank  sum  test  with  continuity  correction 

data:  na  and  a 
W  =  195,  p-value  =  8.644e-06 

alternative  hypothesis:  true  location  shift  is  not  equal  to  0 
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The  test  decision  is  best  made  by  using  the  p-w alue. 


Solution  to  Exercise  10.4  Since  the  data  before  and  after  the  diet  is  dependent 
(weight  measured  on  the  same  subjects),  we  need  to  use  the  paired  t- test.  To  calculate 
the  test  statistic,  we  need  to  first  determine  the  weight  differences: 


Person  i 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Before  diet 

80 

95 

70 

82 

71 

70 

120 

105 

111 

90 

After  diet 

78 

94 

69 

83 

65 

69 

118 

103 

112 

88 

Differences  d 

2 

1 

1 

-1 

6 

1 

2 

2 

-1 

2 

Using  d  =  1.5  and 


1 

10-  1 


(0.52  +  0.52  +  0.52  +  2.52  +  4.52  +  •  •  •  +  0.52) 


3.83, 


we  calculate  the  test  statistic  as 


d  r~  1.5  / — 

t(d)  =  —Cn  =  ^=V10  =  2.42. 

$d  V3.83 

The  null  hypothesis  is  rejected  because  \t{d)\  =  2.42  >  tg- 0.975  =  2.26.  This  means 
the  mean  weight  before  and  after  the  diet  is  different.  We  would  have  obtained  the 
same  results  by  calculating  the  confidence  interval  for  the  differences: 


d  ±  tn- l;l-a/2 


1.5  ±2.26- 


VT83 

yio 


[0.1;  2.9]. 


Since  the  confidence  interval  does  not  overlap  with  zero,  we  reject  the  null  hypothesis; 
there  is  enough  evidence  that  the  mean  difference  is  different  (i.e.  greater)  from  zero. 
While  the  test  is  significant  and  suggests  a  weight  difference,  it  is  worth  noting  that 
the  mean  benefit  from  the  diet  is  only  1.5  kg.  Whether  this  is  a  relevant  reduction  in 
weight  has  to  be  decided  by  the  ten  people  experimenting  with  the  diet. 


Solution  to  Exercise  10.5 

(a)  The  production  is  no  longer  profitable  if  the  probability  of  finding  a  deficient 
shirt  is  greater  than  10  %.  This  equates  to  the  hypotheses: 

H0  :  p  <  0.1  versus  Hq  :  p  >  0.1. 
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The  sample  proportion  of  deficient  shirts  is  p  =  ^  Since  np(l  —  p)  = 

230  •  ^  ^  >  9  we  can  use  the  test  statistic 

t(x)  =  +  ~  P0  V”  =  M  ~  •  V230 

Vpo(i  -  po)  fl TY 

V  io  10 

=  —  •  —  •  V230  =  —  •  V230  =  2.638. 

115  3  23 

The  null  hypothesis  gets  rejected  because  t(x)  =  2.638  >  zo.95  =  1.64.  It  seems 
the  production  is  no  longer  profitable. 

(b)  The  test  statistic  is  t  (x)  =  ^  jc*  =  30.  The  critical  region  can  be  determined  by 
calculating 

PHq(Y  <  ci)  <  0.025  and  PHq(Y  >  cr)  <  0.975. 

Using  R  we  get 

qbinom(p=0 . 975 ,prob=0 . 1 , size=230) 

[1]  32 

qbinom(p=0 . 025 ,prob=0 . 1 , size=230) 

[1]  14 


The  test  statistic  ( t(x )  =  30)  does  not  fall  outside  the  critical  region  ([14;  32]); 
therefore,  the  null  hypothesis  is  not  rejected.  The  same  result  is  obtained  by 
using  the  binomial  test  in  R :  binom.test (c (30, 200)  ,p=0.1).  This  yields 
a  p-value  of  0.11239  and  a  confidence  interval  covering  10  %  ([0.09;  0.18]). 
Interestingly,  the  approximate  binomial  test  rejects  the  null  hypothesis,  whereas 
the  exact  test  keeps  it.  Since  the  latter  test  is  more  precise,  it  is  recommended  to 
follow  its  outcome. 

(c)  The  research  hypothesis  is  that  the  new  machine  produces  fewer  deficient  shirts: 

77o  :  Pntw  —  Pold  versus  H\  :  Pnew  <  Po\&- 
To  calculate  the  test  statistic,  we  need  the  following: 


This  yields: 


-^new  -^old 

^new  ^  old 

-%ew  +  dC0ld 

72  new  T  77  old 


7  7 

U5  _  46 
7  +  35 

230  +  115 


new? 


+>ld) 


d 


21 
 230 

42  _  14 
345  ““  U5' 


21 

230 

101/  1 
115  1 115 


+ 


1 

230 


) 


21 

230 


V0.1069- 0.013 


0.0913 


-2.448. 


0.0373 
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The  null  hypothesis  is  rejected  because  t(x new>*oid)  =  —2.448  <  zo.05  = 
— zo .95  =  —1.64.  This  means  we  accept  the  alternative  hypothesis  that  the  new 
machine  is  better  than  the  old  one. 

(d)  The  data  can  be  summarized  in  the  following  table: 


Machine  1  Machine  2 

Deficient 

Fine 

30  7 

200  112 

To  test  the  hypotheses  established  in  (c),  we  apply  the  fisher  .test  command 
onto  the  contingency  table: 

fisher .test (matrix (c (30 , 200, 7, 112) ,ncol=2)) 


This  yields  a  p-value  of  0.0438  suggesting,  as  the  test  in  (c),  that  the  null  hypoth¬ 
esis  should  be  rejected.  Note  that  the  confidence  interval  for  the  odds  ratio,  also 
reported  by  R ,  does  not  necessarily  yield  the  same  conclusion  as  the  test  of 
Fisher. 


Solution  to  Exercise  10.6 


(a)  To  compare  the  two  means,  we  should  use  the  Welch  test  (since  the  variances 
cannot  be  assumed  to  be  equal).  The  test  statistic  is 


t(x,  y )  = 


\x  -  y\ 


103  -  101.8| 


0.0337. 


12599.56 

10 


+ 


62.84 

10 


The  alternative  hypothesis  is  H\  :  fi2  >  ;  be.  we  deal  with  a  one-sided  hypoth¬ 

esis.  The  null  hypothesis  is  rejected  if  t(x,  y)  >  tv- i_a.  Calculating 

|  jjV  ,  (4/ni A 

n\  n2  J  \  n\  -  1  U2  —  1  J 

62.844  12599\2  / (62.844/10)2  (12599/10)2 

10  +  10  )  \  9  +  9 

yields  t9.09;0.95  =  1.831  (using  qt(0.95,9.09)  in  R;  or  looking  at  Table  C.2 
for  9  degrees  of  freedom).  Therefore,  t(x,  y)  <  t9.09;0.95  and  the  nuU  hypothesis 
is  not  rejected.  This  means  player  2  could  not  prove  that  he  scores  higher  on 
average. 
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(b)  We  have  to  first  rank  the  merged  data: 


Value 

6 

29 

40 

47 

47 

64 

87 

88 

91 

98 

Sample 

2 

2 

2 

2 

2 

2 

2 

1 

1 

2 

Rank 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

Value 

99 

99 

101 

104 

105 

108 

111 

112 

261 

351 

Sample 

1 

1 

1 

1 

1 

1 

1 

1 

2 

2 

Rank 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

This  gives  us  7?i+  =  8  H-  9  +  *  *  *  H-  18  =  133,  7?2+  =  1+  2  +  --  -  + 20  =  77, 
U\  =  102  +  (10-  ll)/2  —  133  =  22,  U2  =  102  +  (10  •  1 1)/2  —  77  =  78,  and 
therefore  U  =  22.  The  test  statistic  can  thus  be  calculated  as 


t(x,  y)  = 


U  - 


721-722 


22- 


10^ 


n  i  •  Yi2  '  in  i  +^2  +  1) 


10-  10-21 


=  -2.12. 


12 


12 


Since  \t(x,  y)\  =  2.12  >  z l-a  =  1-64,  the  null  hypothesis  is  rejected.  The  test 
supports  the  alternative  hypothesis  of  higher  points  for  player  1 .  The  U -test  has 
the  advantage  of  not  being  focused  on  the  expectation  /i.  The  two  samples  are 
clearly  different:  the  second  player  scores  with  much  more  variability  and  his 
distribution  of  scores  is  clearly  not  symmetric  and  normally  distributed.  Since 
the  sample  is  small,  and  the  assumption  of  a  normal  distribution  is  likely  not 
met,  it  makes  sense  to  not  use  the  t- test.  Moreover,  because  the  distribution  is 
skewed  the  mean  may  not  be  a  sensible  measure  of  comparison.  The  two  tests 
yield  different  conclusions  in  this  example  which  shows  that  one  needs  to  be 
careful  in  framing  the  right  hypotheses.  A  drawback  of  the  U -test  is  that  it  uses 
only  ranks  and  not  the  raw  data:  it  thus  uses  less  information  than  the  t- test 
which  would  be  preferred  when  comparing  means  of  a  reasonably  sized  sample. 


Solution  to  Exercise  10.7  Otto  speculates  that  the  probability  of  finding  a  bear 
of  colour  i  is  0.2,  i.e.  Pwhite  —  0.2,  pred  —  0.2,  /^orange  —  0.2,  /^yellow  —  0.2,  and 
P green  =  0.2.  This  hypothesis  can  be  tested  by  using  the  x2  goodness-of-fit  test. 
The  test  statistic  is 

9  ( Ni  —  npi)2  1  9  9 

t(x)  =  X2  =  y  — - —  =  ^(222  -  250) 2  +  (279  -  250)2 

“  npi  250 

+  (251  -  250)2  +  (232  -  250)2  +  (266  -  250)2  =  8.824. 


The  null  hypothesis  cannot  be  rejected  because  t(x)  =  8.824  ^  C4_i_o;o.95  =  9.49. 
While  the  small  number  of  white  bears  might  be  disappointing,  the  test  suggests  that 
there  is  not  enough  evidence  to  reject  the  hypothesis  of  equal  probabilities. 
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Solution  to  Exercise  10.8 

(a)  To  answer  this  question,  we  need  to  conduct  the  x2 -independence  test.  The  test 
statistic  t(x,  y)  is  identical  to  Pearson’s  x2  statistic,  introduced  in  Chap.  4.  In 
Exercise  4.4,  we  have  calculated  this  statistic  already,  x2  ~  182,  see  p.  350 
for  the  details.  The  null  hypothesis  of  independence  is  rejected  if  t(x,  y)  = 
X2  >  C(/_ i)(j_ i); i— q/.  With  1  =  2  (number  of  rows),  and  J  =  4  (number  of 
columns)  we  get  C3;o.95  =  7.81  using  Table  C.3  (or  qchisq(0 . 95,3)  in  R). 
Since  182  >  7.81,  we  reject  the  null  hypothesis  of  independence. 

(b)  The  output  refers  to  a  x2  test  °f  homogeneity:  the  null  hypothesis  is  that  the 
proportion  of  passengers  rescued  is  identical  for  the  different  travel  classes.  This 
hypothesis  is  rejected  because  p  is  smaller  than  a  =  0.05.  It  is  evident  that  the 
proportions  of  rescued  passengers  in  the  first  two  classes  (60  %,  43.9  %)  are  much 
higher  than  in  the  other  classes  (25  %,  23.8  %).  One  can  see  that  the  test  statistic 
(182.06)  is  identical  to  (a).  This  is  not  surprising:  the  x2' -independence  test  and 
the  x2  test  °f  homogeneity  are  technically  identical,  but  the  null  hypotheses 
differ.  In  (a),  we  showed  that  “rescue  status”  and  “travel  class”  are  independent; 
in  (b),  we  have  seen  that  the  conditional  distributions  of  rescue  status  given  travel 
class  differ  by  travel  class,  i.e.  that  the  proportions  of  those  rescued  differ  by  the 
categories  1.  class/2,  class/3,  class/staff. 

(c)  The  summarized  table  is  as  follows: 


1.  Class/2.  Class  3.  Class/Staff 

Total 

Rescued 

Not  rescued 

327  391 

295  1215 

718 

1510 

Total 

622  1606 

2228 

Using  (4.7)  we  get 


y )  =  x2 


2228(327  •  1215  -  295  •  391)2 
718-  1510-622-  1606 


163.55. 


The  x2' -independence  test  and  the  x2  test  of  homogeneity  are  technically  identi¬ 
cal:  Hq  is  rejected  if  t(x,  y )  >  C(/_i)(/_i);i_a.  Since  163.55  >  O;0.95  =  3.84 
the  null  hypothesis  is  rejected.  As  highlighted  in  (b)  this  has  two  interpretations: 
(i)  “rescue  status”  and  “travel  class”  are  independent  (independence  test)  and  (ii) 
the  conditional  distributions  of  rescue  status  given  travel  class  differ  by  travel 
class  (homogeneity  test).  The  second  interpretation  implies  that  the  probability 
of  being  rescued  differs  by  travel  class.  The  null  hypothesis  of  the  same  prob¬ 
abilities  of  being  rescued  is  also  rejected  using  the  test  of  Fisher.  Summarizing 
the  data  in  a  matrix  and  applying  the  f  isher  .test  command  yields  a  p-value 
smaller  than  a  =  0.05. 


fisher . test (matrix(c (327 , 295 , 391 , 1215) ,ncol=2 ,nrow=2) ) 
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Solution  to  Exercise  10.9 


(a)  The  hypotheses  are: 

Ho  :  Ex  =  Ey  l  versus 
The  pooled  variance, 

9  19  •  2.94  +  19  •  2.46 

s  =  - 

39 

is  needed  to  calculate  the  test  statistic: 


Hi  :  Ex  /  Ey  l 


102.6 

39 


=  2.631, 


t(x,  y)  = 


4.97  -  4.55  400  0.42 


Vl0  =  0.8188. 


1.622  V  40  1.622 

Ho  is  rejected  if  | t(x,  y)  |  >  ^38, 0.975  ~  2.02.  Since  \t(x,  y)|  =  0.8188  we  do  not 
reject  the  null  hypothesis. 

(b)  The  hypotheses  are: 

Ho  :  Ex  =  EY2  versus  H\  :  fix  ^  EY2 • 

The  pooled  variance, 

9  19-2.94+19-3.44 

s2  = - - - =  3.108, 


39 

is  needed  to  calculate  the  test  statistic: 

4.97  -  3.27 
t(x,  y )  =  - 


1.763 


•  VTO  =  3.049. 


Ho  is  rejected  because  \t(x,y)\  =  3.049  >  ^38, 0.975  ~  2.02. 

(c)  In  both  (a)  and  (b),  there  exists  a  true  difference  in  the  mean.  However,  only  in  (b) 
is  the  Mest  able  to  detect  the  difference.  This  highlights  that  smaller  differences 
can  only  be  detected  if  the  sample  size  is  sufficiently  large.  However,  if  the 
sample  size  is  very  large,  it  may  well  be  that  the  test  detects  a  difference  where 
there  is  no  difference. 


Solution  to  Exercise  10.10 


(a)  After  reading  in  and  attaching  the  data,  we  can  simply  use  the  t .  test  command 
to  compare  the  expenditure  of  the  two  samples: 

theatre  <-  read. csv( ' theatre . csv ' ) 
attach (theatre) 

t . test (Culture [Sex==l] , Culture  [Sex==0] ) 
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Welch  Two-Sample  t-test 

data:  Culture  [Sex  ==  1]  and  Culture [Sex  ==  0] 

t  =  -1.3018,  df  =  667.43,  p-value  =  0.1934 
alternative  hypothesis:  true  difference  not  equal  to  0 
95  percent  confidence  interval: 

-12.841554  2.602222 

sample  estimates: 
mean  of  x  mean  of  y 
217.5923  222.7120 

We  see  that  the  null  hypothesis  is  not  rejected  because  p  =  0.1934  >  a  (also, 
the  confidence  interval  overlaps  with  “0”). 

(b)  A  two-sample  t-test  and  a  t/-test  yield  the  same  conclusion.  The  y>-values, 
obtained  with 


wilcox . test (Culture [Sex==l] , Culture  [Sex==0] ) 
t . test (Culture [Sex==l] , Culture [Sex==0] , var . equal=TRUE) 


are  0.1946  and  0.145,  respectively.  Interestingly,  the  test  statistic  of  the  two- 
sample  t-test  is  almost  identical  to  the  one  from  the  Welch  test  in  (a)  (—  1 .2983) — 
this  indicates  that  the  assumption  of  equal  variances  may  not  be  unreasonable. 

(c)  We  can  use  the  usual  t .  test  command  together  with  the  option  alternative 
=  *  greater  *  to  get  the  solution. 


t . test (Theatre [Sex==l] , Theatre  [Sex==0] , 
alternative= ' greater ' ) 


Here,  p  is  much  smaller  than  0.001;  hence,  the  null  hypothesis  can  be  rejected. 
Women  spend  more  on  theatre  visits  than  men. 

(d)  We  deal  with  dependent  (paired)  data  because  different  variables  (expenditure 
this  year  versus  expenditure  last  year)  are  measured  on  the  same  observations. 
Thus,  we  need  to  use  the  paired  t-test — which  we  can  use  in  R  by  specifying 
the  paired  option: 


t . test (Theatre ,Theatre_ly ,paired=TRUE) 
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Paired  t-test 

data:  Theatre  and  Theatre_ly 

t  =  1.0925,  df  =  698,  p-value  =  0.275 

alternative  hypothesis:  true  difference  in  means  is  !=  0 
95  percent  confidence  interval: 

-2.481496  8.707533 

sample  estimates: 
mean  of  the  differences 

3.113019 

Both  the  p -value  (which  is  greater  than  0.05)  and  the  confidence  interval  (over¬ 
lapping  with  “0”)  state  that  the  null  hypothesis  should  be  kept.  The  mean  differ¬ 
ence  in  expenditure  (3.1  SFR)  is  not  large  enough  to  suggest  that  this  difference 
is  not  caused  by  chance.  Visitors  spend,  on  average,  about  the  same  in  the  two 
years. 


Solution  to  Exercise  10.11 


(a)  We  can  use  the  one- sample  t- test  to  answer  both  questions: 

t . test (temperature ,mu=65 , alternative= ' greater ' ) 
t . test (time ,mu=30 , alternative= ' less ' ) 


One-sample  t-test 

data:  temperature 

t  =  -11.006,  df  =  1265,  p-value  =  1 

alternative  hypothesis:  true  mean  is  greater  than  65 


data:  time 

t  =  23.291,  df  =  1265,  p-value  =  1 

alternative  hypothesis:  true  mean  is  less  than  30 


We  cannot  confirm  that  the  mean  delivery  time  is  less  than  30  min  and  that 
the  mean  temperature  is  greater  than  65  °C.  This  is  not  surprising:  we  have 
already  seen  in  Exercise  3.10  that  the  manager  should  be  unsatisfied  with  the 
performance  of  his  company. 

(b)  We  can  use  the  exact  binomial  test  to  investigate  Ho  :  p  >  0.15  and  H\  :  p  < 
0.15.  For  the  binom  .test  command,  we  need  to  know  the  numbers  of  successes 
and  failures,  i.e.  the  number  of  deliveries  where  a  free  wine  should  have  been 
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given  to  the  customer.  Applying  the  table  commands  yields  229  and  1037 
deliveries,  respectively. 

table (free_wine) 

binom. test (c (229 , 1037) ,p=0 . 15 , alternative= ' less ' ) 


Exact  binomial  test 
data:  c(229,  1037) 

number  of  successes  =  229,  number  of  trials  =  1266, 
p-value  =  0.9988  alternative  hypothesis:  true  probability 
is  less  than  0.15 
95  percent  confidence  interval: 

0.0000000  0.1996186 

sample  estimates,  probability  of  success:  0.1808847 


The  null  hypothesis  cannot  be  rejected  because  p  =  0.9988  >  a  =  0.05  and 
because  the  confidence  interval  covers  p  =  0.15.  We  get  the  same  results  if  we 
use  the  variable  “got_wine”  instead  of  “free_wine”.  While  the  test  says  that  we 
cannot  exclude  the  possibility  that  the  probability  of  receiving  a  free  wine  is 
less  than  15  %,  the  point  estimate  of  18  %  suggests  that  the  manager  still  has  to 
improve  the  timeliness  of  the  deliveries  or  stop  the  offer  of  free  wine. 

(c)  We  first  need  to  create  a  new  categorical  variable  (using  cut)  which  divides  the 
temperatures  into  two  parts:  below  and  above  65  °C.  Then  we  can  simply  apply 
the  test  commands  (fisher .test,  chisq.test,  prop. test)  to  the  table  of 
branch  and  temperature: 


hot  <-  cut (temperature ,breaks=c (-Inf , 65 , Inf ) ) 
fisher . test (table (hot , operator) ) 
chisq.test (table (hot , operator) ) 
prop . test (table (hot , operator) ) 


We  know  that  the  two  x2  tests  lead  to  identical  results.  For  both  of  them  the  p- 
value  is  0.2283  which  suggests  that  we  should  keep  the  null  hypothesis.  There 
is  not  enough  evidence  which  would  support  that  the  proportion  of  hot  pizzas 
differs  by  operator,  i.e.  that  the  two  variables  are  independent!  The  test  of  Fisher 
yields  the  same  result  ( p  =  0.2227). 

(d)  The  null  hypothesis  is  that  the  proportion  of  deliveries  is  the  same  for  each 
branch:  Ho  :  y>East  =  PWest  =  ^Centre-  To  test  this  hypothesis,  we  need  a  x2 
goodness-of-fit  test: 


chisq.test (table (branch) ) 
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Chi-squared  test  for  given  probabilities 
data:  table (branch) 

X-squared  =  0.74408,  df  =  2,  p-value  =  0.6893 

The  null  hypothesis  of  equal  proportions  is  therefore  not  rejected  (p  >  a  = 
0.05). 

(e)  To  compare  three  proportions  we  need  to  use  the  x2  homogeneity  test: 


prop . test (table (branch,  operator) ) 


X-squared  =  0.15719,  df  =  2,  p-value  =  0.9244 
alternative  hypothesis:  two. sided 
sample  estimates: 

prop  1  prop  2  prop  3 
0.5059382  0.5097561  0.4965517 

We  can  see  that  the  proportions  are  almost  identical  and  that  the  null  hypothesis 
is  not  rejected  (p  =  0.9244). 

(f)  To  test  this  relationship,  we  use  the  x2 -independence  test: 


chisq. test (table (driver ,  branch) ) 


X-squared  =  56.856,  df  =  8,  p-value  =  1.921e-09 
The  null  hypothesis  of  independence  is  rejected. 


Solution  to  Exercise  10.12  To  test  the  hypothesis  Pshalabh  =  PHeumann 
=  j?schomaker  =  1/3  we  use  the  x2  goodness-of-fit  test.  The  test  statistic  is 


Z 

i  =  l 


(Nj  -  npj)‘ 

npi 


1 


111 


[(110  -  111)2  +  (118  -  111)2  +  (105  -  111)2] 


(1  +  49  +  36)/lll  ss  0.77 


Ho  gets  rejected  if  x2  =  0.77  >  C3_i_o,i-o.oi  —  9.21.  Thus,  the  null  hypothesis  is 
not  rejected.  We  accept  that  all  three  authors  took  about  one-third  of  the  pictures. 
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Chapter  1 1 


Solution  to  Exercise  11.1 

(a)  Calculating  x  =  ^(26  +  23  +  27  +  28  +  24  +  25)  =  25.5  and  y  =  ^(170  + 
150  +  160+  175  +  155  +  150)  =  160,  we  obtain  the  following  table  needed 

/V 

for  the  estimation  of  a  and  f3: 


Body  mass  index 

Xj  Xi  -  X  (Xj  -  x)2 

Systolic  blood  pressure 

yt  yt  -  y  (+  -  y )2 

Vi 

26 

0.5 

0.25 

170 

10 

100 

5 

23 

-2.5 

6.25 

150 

-10 

100 

25 

27 

1.5 

2.25 

160 

0 

0 

0 

28 

2.5 

6.25 

175 

15 

225 

37.5 

24 

-1.5 

2.25 

155 

-5 

25 

7.5 

25 

-0.5 

0.25 

150 

-10 

100 

5 

Total  153 

17.5 

960 

550 

80 

With  Vi  =  (xi  —  x)  •  (yi  —  y)  =  80,  it  follows  that  Sxy  =  80.  Moreover, 
we  get  Sxx  =  X,  (xi  ~  x)2  =  17.5  and  Syy  =  -  y)2  =  550.  The  para- 

meter  estimates  are  therefore 


80 

VL5 


4.57, 


a  =  y-(3x  =  160  -  4.57  •  25.5  =  43.465. 


A  one-unit  increase  in  the  BMI  therefore  relates  to  a  4.57  unit  increase  in  the 
blood  pressure.  The  model  suggests  a  positive  association  between  BMI  and 
systolic  blood  pressure.  It  is  impossible  to  have  a  BMI  of  0;  therefore,  a  cannot 
be  interpreted  meaningfully  here. 

(b)  Using  (11.14),  we  obtain  R2  as 


Thus  66  %  of  the  data’s  variability  can  be  explained  by  the  model.  The  goodness 
of  fit  is  good,  but  not  perfect. 


410 


Appendix  B:  Solutions  to  Exercises 


Solution  to  Exercise  11.2 


A 

(a)  To  estimate  /3 ,  we  use  the  second  equality  from  (1 1.6): 

Ya= i  xi  yi  -nxy 


P  = 


Zn  7 

i=i  xf  ~  nx ' 

'ti 


Calculating  v  =  1.333,  y  =  5.556,  X/2=i  xi  yi  =  62.96,  ^”=1  xf  —  24.24, 
and  X/=i  l7?  —  281.5  leads  to 


^  62.91  -  9-1.333-5.556  -3.695 

M  24.24  -  9- 1.3332  8.248 

a  =  y  —  /3x  =  5.556  +  0.45  •  1.333  «  6.16, 
yi  —  6.16  —  0.45x/. 


For  children  who  spend  no  time  on  the  internet  at  all,  this  model  predicts  6.16  h 
of  deep  sleep.  Each  hour  spent  on  the  internet  decreases  the  time  in  deep  sleep 
by  0.45  h  which  is  27  min. 


(b)  Using  the  results  from  (a)  and  Syy  =  X!=i  yf  ~  ny2 

(— 3.695)2 


R2  =  r2  = 


S2 

^xy 


$XX  Syy 


8.248  •  3.678 


~  3.678  yields: 
«  0.45. 


About  45  %  of  the  variation  can  be  explained  by  the  model.  The  fit  of  the  model 
to  the  data  is  neither  very  good  nor  very  bad. 

(c)  After  collecting  the  data  in  two  vectors  (c  ()),  printing  a  summary  of  the  linear 
model  (summary  (lm() ) )  reproduces  the  results.  A  scatter  plot  can  be  produced 
by  plotting  the  two  vectors  against  each  other  (plot  () ).  The  regression  line  can 
be  added  with  abline  ( ) : 


it  <-  c(0.3,2.2, . . . ,2.3) 
sleep  <-  c(5.8,4.4, . . . ,6. 1) 
summary  (lm(sleep~it) ) 
plot (it , sleep) 
abline (a=6 . 16 ,b=-0 . 45) 


The  plot  is  displayed  in  Fig.  B.20. 

(d)  Treating  X  as  a  binary  variable  yields  the  following  values:  0,  1,0,  0,0,  1,  1,0,  1. 
We  therefore  have  v  =  0.444,  ^x2  =  4,  and  =  4.4  +  5.0  +  4.8  + 

6.1  =  20.3.  Since  the  T-values  stay  the  same  we  calculate 

a  _  X?=i  xi  yi  ~  n*y  _  20-3  -  9  •  5.556  •  0.444 

'  “  XLi  -  «-*2  ~  4  -  9  •  0.4442 


-0.85. 
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Fig.  B.20  Scatter  plot  and 
regression  line  for  the 
association  of  internet  use 
and  deep  sleep 


Thus,  those  children  who  are  on  the  internet  for  a  long  time  (i.e.  >1  h)  sleep  on 

/V 

average  0.85  h  (=51  min)  less.  If  we  change  the  coding  of  l’s  and  0’s,  /3  will 

A 

just  have  a  different  sign:  /3  =  0.85.  In  this  case,  we  can  conclude  that  children 
who  spend  less  time  on  the  internet  sleep  on  average  0.85  h  longer  than  children 
who  spend  more  time  on  the  internet.  This  is  the  same  conclusion  and  highlights 
that  the  choice  of  coding  does  not  affect  the  interpretation  of  the  results. 


Solution  to  Exercise  11.3 


(a)  The  correlation  coefficient  is 


r 


Sxy  _  170,  821  -  17- 166.65-60.12 

yj Syy  Sxx  7(62, 184  -  17  •  60.122)(472,  569  -  17  •  166.652) 


498.03 

7738.955-441.22 


0.87. 


This  indicates  strong  positive  correlation:  the  higher  the  height,  the  higher 
the  weight.  Since  R2  =  r2  =  0.872  ~  0.76,  we  already  know  that  the  fit  of 
a  linear  regression  model  will  be  good  (no  matter  whether  height  or  weight 

/V 

is  treated  as  outcome).  From  (11.11),  we  also  know  that  [3  will  be  positive. 


Weight 
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Fig.  B.21  Scatter  plot  and  regression  line  for  both  17  and  19  observations 


(b)  We  know  from  (a)  that  Sxy  =  498.03  and  that  Sxx  =  441.22.  The  least 
squares  estimates  are  therefore 


498.03 

441.22 


1.129, 


a  =  60.12  -  166.65  •  1.129  =  -128.03. 


Each  centimetre  difference  in  height  therefore  means  a  1.129  kg  difference 
in  weight.  It  is  not  possible  to  interpret  a  meaningfully  in  this  example. 

(c)  The  prediction  is 

-128.03  +  1.129  •  175  =  69.545  kg. 


(d)-(g)  The  black  dots  in  Fig.  B.21  show  the  scatter  plot  of  the  data.  There  is  clearly 
a  positive  association  in  that  greater  height  implies  greater  weight.  This  is 
also  emphasized  by  the  regression  line  estimated  in  (b).  The  two  additional 
points  appear  in  dark  grey  in  the  plot.  It  is  obvious  that  they  do  not  match  the 
pattern  observed  in  the  original  17  data  points.  One  may  therefore  speculate 

/V 

that  with  the  inclusion  of  the  two  new  points  f3  will  be  smaller.  To  estimate 
the  new  regression  line  we  need 

1 

x  =  —(17  •  166.65  +  150  +  175)  =  166.21, 
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This  yields 


Z"=i  xi  yi  -  nxy 

Z”=i  xf  ~  nx2 
«  0.28. 


191696  -  19  •  166.21  •  60.63  _  227.0663 
525694  -  19  •  166.21 2  “  804.4821 


This  shows  that  the  two  added  points  shrink  the  estimate  from  1 . 129  to  0.28. 
The  association  becomes  less  clear.  This  is  an  insightful  example  showing 
that  least  squares  estimates  are  generally  sensitive  to  outliers  which  can 
potentially  affect  the  results. 


Solution  to  Exercise  11.4 

(a)  The  point  estimate  of  (3  suggests  a  0.077  %  increase  of  hotel  occupation  for 
each  one  degree  increase  in  temperature.  However,  the  null  hypothesis  of  (3  =  0 
cannot  be  rejected  because  p  =  0.883  >0.05.  We  therefore  cannot  show  an 
association  between  temperature  and  hotel  occupation. 

(b)  The  average  hotel  occupation  is  higher  in  Davos  (7.9  %)  and  Polenca  (0.9  %) 
compared  with  Basel  (reference  category).  However,  these  differences  are  not 
significant.  Both  Ho  :  /feavos  =  0  and  Ho  :  /^Polenca  =  0  cannot  be  rejected. 
The  model  cannot  show  a  significant  difference  in  hotel  occupation  between 
Davos/Polenca  and  Basel. 

(c)  The  analysis  of  variance  table  tells  us  that  the  null  hypothesis  of  equal  average 
temperatures  in  the  three  cities  ( (3\  =  (32  =  0)  cannot  be  rejected.  Note  that  in 
this  example  the  overall  F- test  would  have  given  us  the  same  results. 

(d)  In  the  multivariate  model,  the  main  conclusions  of  (a)  and  (b)  do  not  change: 
testing  Ho  :  / 3j  =  0  never  leads  to  the  rejection  of  the  null  hypothesis.  We  cannot 
show  an  association  between  temperature  and  hotel  occupation  (given  the  city); 
and  we  cannot  show  an  association  between  city  and  hotel  occupation  (given  the 
temperature). 

(e)  Stratifying  the  data  yields  considerably  different  results  compared  to  (a)-(c):  In 
Davos,  where  tourists  go  for  skiing,  each  increase  of  1  °C  relates  to  a  drop  in  hotel 
occupation  of  2.7  %.  The  estimate  / 3  ~  —2.7  is  also  significantly  different  from 
zero  ( p  =  0.000231).  In  Polenca,  a  summer  holiday  destination,  an  increase  of 
1  °C  implies  an  increase  of  hotel  occupation  of  almost  4  %.  This  estimate  is  also 
significantly  different  from  zero  ( p  =  0.00114  <  0.05).  In  Basel,  a  business 
destination,  there  is  a  somewhat  higher  hotel  occupation  for  higher  temperatures 
(J3  =  1.3);  however,  the  estimate  is  not  significantly  different  from  zero.  While 
there  is  no  overall  association  between  temperature  and  hotel  occupation  (see 
(a)  and  (c)),  there  is  an  association  between  them  if  one  looks  at  the  different 
cities  separately.  This  suggests  that  an  interaction  between  temperature  and  city 
should  be  included  in  the  model. 
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(f)  The  design  matrix  contains  a  column  of  l’s  (to  model  the  intercept),  the  temper¬ 
ature  and  two  dummies  for  the  categorical  variable  “city”  because  it  has  three 
categories.  The  matrix  also  contains  the  interaction  terms  which  are  both  the 
product  of  temperature  and  Davos  and  temperature  and  Polenca.  The  matrix  has 
36  rows  because  there  are  36  observations:  12  for  each  city. 


1 

2 


Temp. 

-6 

-5 


Davos  Polenca  Temp,  x  Davos  Temp,  x  Polenca 

10-6  0 

10-5  0 


12  1 
13  1 


24  1 

25  1 

36  \  1 


0 

10 

12 

1 

4 


1 

0 


0 

1 


0 

0 


1 

0 


0 


0 


0 

0 

0 

0 

0 


0 

10 

12 

0 

0 


/ 


(g)  Both  interaction  terms  are  significantly  different  from  zero  (p  =  0.000375  and 
p  =  0.033388).  The  estimate  of  temperature  therefore  differs  by  city,  and  the 
estimate  of  city  differs  by  temperature.  For  the  reference  city  of  Basel,  the  associ¬ 
ation  between  temperature  and  hotel  occupation  is  estimated  as  1.31;  for  Davos 
it  is  1.31  —  4.00  =  —2.69  and  for  Polenca  1.31  +  2.66  =  3.97.  Note  that  these 
results  are  identical  to  (d)  where  we  fitted  three  different  regressions — they  are 
just  summarized  in  a  different  way. 


(h)  From  (f)  it  follows  that  the  point  estimates  for  temperature  are  1.31  for  Basel, 
—2.69  for  Davos,  and  3.97  for  Polenca.  Confidence  intervals  for  these  estimates 
can  be  obtained  via  (11.29): 

/V  /V 

(t  +  Pj)  ±  tn-p- l;l-a/2  ‘  Vfa+PjY 

We  calculate  tn-p- l-l-a/2  =  *36-5-1,0.975  =  *30,0.975  =  2.04.  With 
Var(temp.)  =  0.478  (obtained  via  0.69 162  from  the  model  output  or  from  the 
second  row  and  second  column  of  the  covariance  matrix),  Var(temp:Davos)  = 
0.997,  Var(/?p0ienCa)  =  1.43,  Cov(temp.  >  temp: Davos)  =  0.48,  and  also 

Cov(temp.  ?  temp:Polenca)  —  0.48  we  obtain: 


<J 


(Aemp.  +7d  avos) 


a 


(Acmp.  +  /^Polenca) 


a 


(Aemp.  +  A>asel) 


V0. 478  +  0.997  -  2  •  0.48  %  0.72, 
V0. 478  +  1.43  —  2  •  0.48  %  0.97, 
V0. 478  +  0  +  0  %  0.69. 
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The  95  %  confidence  intervals  are  therefore: 


Davos: 

Polenca: 

Basel: 


[-2.69  ±  2.04  •  0.72] 
[3.97  ±2.04- 0.97] 
[1.31  ±2.04-0.69] 


[-4.2;  -1.2], 
[2.0;  5.9], 
[-0.1;  2.7]. 


Solution  to  Exercise  11.5 


(a)  The  missing  value  [1]  can  be  calculated  as 


Pi  -  pi  _  0.39757  -  0 
Go  ~  0.19689 


2.019. 


Since  ^99-5-1,0.975  =  1-96  and  2.019  >  1.96,  it  is  clear  that  the  p-value 
from  [2]  is  smaller  than  0.05.  The  exact  p -value  can  be  calculated  in  R  via 
(1-pt  (2 .019 ,  693) )  *2  which  yields  0.0439.  The  pt  command  gives  the 
probability  value  for  the  quantile  of  2.019  (with  693  degrees  of  freedom): 
0.978.  Therefore,  with  probability  (1  —  0.978)  %  a  value  is  right  of  2.019  in 
the  respective  t -distribution  which  gives,  multiplied  by  two  to  account  for  a 
two-sided  test,  the  j9-value. 


(b)-(c)  The  plot  on  the  left  shows  that  the  residuals  are  certainly  not  normally  distrib¬ 
uted  as  required  by  the  model  assumptions.  The  dots  do  not  approximately 
match  the  bisecting  line.  There  are  too  many  high  positive  residuals  which 
means  that  we  are  likely  dealing  with  a  right- skewed  distribution  of  residu¬ 
als.  The  plot  on  the  right  looks  alright:  no  systematic  pattern  can  be  seen;  it 
is  a  random  plot.  The  histogram  of  both  theatre  expenditure  and  log(theatre 
expenditure)  suggests  that  a  log-transformation  may  improve  the  model, 
see  Fig.  B.22.  Log-transformations  are  often  helpful  when  the  outcome’s 
distribution  is  skewed  to  the  right. 

(d)  Since  the  outcome  is  log-transformed,  we  can  apply  the  interpretation  of  a 
log-linear  model: 


•  Each  year’s  increase  in  age  yields  an  exp(0.0038)  =  1.0038  times  higher 
(=0.38  %)  expenditure  on  theatre  visits.  Therefore,  a  10-year  age  dif¬ 
ference  relates  to  an  exp(10  •  0.0038)  =  1.038  times  higher  expenditure 
(=3.8  %). 

•  Women  (gender  =  1)  spend  on  average  (given  the  other  variables) 
exp (0.179)  ^1.20  times  more  money  on  theatre  visits. 

•  Each  1000  SFR  more  yearly  income  relates  to  an  exp(0.0088)  =  1.0088 
times  higher  expenditure  on  theatre  visits.  A  difference  in  10,000  SFR  per 
year  therefore  amounts  to  an  8.8  %  difference  in  expenditure. 

•  Each  extra  Swiss  Franc  spent  on  cultural  activities  is  associated  with  an 
exp(0. 00353)  =  1.0035  times  higher  expenditure  on  theatre  visits. 
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0  100  200  300  400  500 


expenditure  for  theatre  visits  (in  SFR) 
(a)  Expenditure 

Fig.  B  .22  Histogram  of  the  outcome 


o 


log-expenditure  for  theatre  visits  (in  SFR) 


(b)  log(expenditure) 


•  Except  for  theatre  expenditure  from  the  preceding  year,  all  (3j  are  signifi¬ 
cantly  different  from  zero. 


(e)  While  in  (b)  the  residuals  were  clearly  not  normally  distributed,  this  assump¬ 
tion  seems  to  be  fulfilled  now:  the  QQ-plot  shows  dots  which  lie  approxi¬ 
mately  on  the  bisecting  line.  The  fitted  values  versus  residuals  plot  remains 
a  chaos  plot.  In  conclusion,  the  log-transformation  of  the  outcome  helped 
to  improve  the  quality  of  the  model. 


Solution  to  Exercise  11.6 


(a)  The  multivariate  model  is  obtained  by  using  the  lm  ( )  command  and  separating 
the  covariates  with  the  +  sign.  Applying  summary  ( )  to  the  model  returns  the 
comprehensive  summary. 


mp  <-  lm(time  ~  temperature  +  branch  +  day  +  operator  +  driver 
+  bill  +  pizzas  +  discount_customer) 
summary (mp) 
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Estimate 

Std.  Error 

t  value 

PrOltl) 

(Intercept) 

40.42270 

2 . 00446 

20.166 

<  2e-16 

*** 

temperature 

-0.20823 

0.02594 

-8.027 

2.28e-15 

*** 

branchEast 

-1.60263 

0.42331 

-3.786 

0.000160 

*** 

branchWest 

-0.11912 

0.37330 

-0.319 

0.749708 

dayMonday 

-1.15858 

0.63300 

-1.830 

0.067443 

• 

daySaturday 

0.88163 

0.50161 

1.758 

0.079061 

• 

daySunday 

1.01655 

0.56103 

1.812 

0.070238 

• 

dayThursday 

0.78895 

0.53006 

1.488 

0.136895 

dayTuesday 

0.79284 

0.62538 

1.268 

0.205117 

dayWednesday 

0.25814 

0.60651 

0.426 

0.670468 

operatorMelissa 

-0.15791 

0.34311 

-0.460 

0.645435 

driverDomenico 

-2.59296 

0 . 73434 

-3.531 

0.000429 

*** 

driverLuigi 

-0.80863 

0.58724 

-1.377 

0.168760 

driverMario 

-0.39501 

0.43678 

-0.904 

0.365973 

driverSalvatore 

-0.50410 

0 . 43480 

-1.159 

0.246519 

bill 

0.14102 

0.01600 

8.811 

<  2e-16 

*** 

pizzas 

0.55618 

0.11718 

4.746 

2 . 31e-06 

*** 

discount .customer 

-0.28321 

0.36848 

-0.769 

0.442291 

Signif.  codes:  0 

***  0.001 

**  0.01  *  0 

.05  .  0. 

.1  1 

Residual  standard  error:  5.373  on  1248  degrees  of  freedom 
Multiple  R-squared:  0.3178,  Adjusted  R-squared:  0.3085 
F-statistic:  34.2  on  17  and  1248  DF,  p-value :  <  2.2e-16 

The  output  shows  that  lower  temperature,  higher  bills,  and  more  ordered  pizzas 
increase  the  delivery  times.  The  branch  in  the  East  is  the  fastest,  and  so  is 
the  driver  Domenico.  While  there  are  differences  with  respect  to  day,  discount 
customers,  and  the  operator,  they  are  not  significant  at  the  5  %  level. 

(b)  The  confidence  intervals  are  calculated  as:  ±  tn-p-\- \-a/2  •  We  know 
from  the  model  output  from  (a)  that  there  are  1248  degrees  of  freedom  (1266 
observations  —  18  estimated  coefficients).  The  respective  quantile  from  the  t- 
distribution  is  obtained  with  the  qt  ( )  function.  The  coefficients  are  accessed  via 
the  coefficients  command  (alternatively:  mp$coef  f  icients);  the  variances 
of  the  coefficients  are  either  accessed  via  the  diagonal  elements  of  the  covariance 
matrix  (diag(vcov(mp) ))  or  the  model  summary 
(summary  (mp)  [[4]]  [,2]) — both  of  which  are  laborious.  The  summary  of 
coefficients,  lower  confidence  limit  (lcl),  and  upper  confidence  limit  (ucl)  may 
be  summarized  in  a  matrix,  e.g.  via  merging  the  individual  columns  with  the 
cbind  command. 
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lcl  <-  coefficients (mp)  -  qt (0 . 975 , 1248) *sqrt (diag(vcov(mp) ) ) 
ucl  <-  coefficients (mp)  +  qt (0 . 975 , 1248) *sqrt (diag(vcov(mp) ) ) 
cbind(coef f icients (mp) , lcl, ucl) 


(Intercept) 

temperature 

branchEast 

branchWest 

dayMonday 


lcl  ucl 
40.4227014  36.4902223  44.3551805 
-0.2082256  -0.2591146  -0.1573366 
-1.6026299  -2.4331162  -0.7721436 
-0.1191190  -0.8514880  0.6132501 
-1.1585828  -2.4004457  0.0832801 


(c)  The  variance  is  estimated  as  the  residual  sum  of  squares  divided  by  the  degrees 
of  freedom,  see  also  (11.27).  Applying  the  residuals  command  to  the  model 
and  using  other  basic  operations  yields  an  estimated  variance  of  28.86936. 


sum  (residuals  (mp)~2)  /  (mp$df  .  residual) 


Taking  the  square  root  of  the  result  yields  V28. 86936  =  5.37  which  is  also 
reported  in  the  model  output  from  (a)  under  “Residual  standard  error”. 

(d)  The  sum  of  squares  error  is  defined  as  X/Li  (j*  ~  9 i)2-  The  total  sum  of  squares 
is  X?=i  (yi  ~  y)2-  This  can  be  easily  calculated  in  R.  The  goodness  of  fit  is  then 
obtained  as  R2  =  1  —  S  Qeyyoy  /  ^  Qjotal  —  0.3178.  Dividing  S  gError  by  n  —  p  — 
1  =  1266  —  17  —  1  =  1248  and  SQxotai  by  n  —  1  =  1265  yields  R^  =  0.3085. 
This  corresponds  to  the  model  output  from  (a). 


SQE  <-  sum  (residuals  (mp)~2) 

SQT  <-  sum ( (time-mean (time)  )~2) 
1- (SQE/SQT) 

1- ( (SQE/ 1248) / (SQT/1265) ) 


(e)  Applying  stepAIC  to  the  fitted  model  (with  option  “back”  for  backward  selec¬ 
tion)  executes  the  model  selection  by  means  of  AIC. 


library (MASS) 

stepAIC (mp,  direction= ' back 1 ) 
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The  output  shows  that  the  full  model  has  an  AIC  of  4275.15.  The  smallest  AIC 
is  achieved  by  removing  the  operator  variable  from  the  model. 


Step:  AIC=4275 . 15 

time  ~  temperature  +  branch  +  day  +  operator  +  driver  +  bill  + 
pizzas  +  discount_customer 


Df 

Sum  of  Sq 

RSS 

AIC 

-  operator 

1 

6.11 

36035 

4273.4 

-  discount_customer 

1 

17.05 

36046 

4273.8 

<none> 

36029 

4275.2 

"  day 

6 

448.79 

36478 

4278.8 

-  driver 

4 

363.91 

36393 

4279.9 

-  branch 

2 

511.10 

36540 

4289.0 

-  pizzas 

1 

650.39 

36679 

4295.8 

-  temperature 

1 

1860.36 

37889 

4336.9 

-  bill 

1 

2241.30 

38270 

4349 . 6 

The  reduced  model  has  an  AIC  of  4273.37.  Removing  the  discount  customer 
variable  from  the  model  yields  an  improved  AIC  (4272.0  <  4273.37). 

Step:  AIC=4273 . 37 

time  ~  temperature  +  branch  +  day  +  driver  +  bill  +  pizzas  + 
discount  customer 


Df 

Sum  of  Sq 

RSS 

AIC 

-  discount_customer 

1 

17.57 

36053 

4272.0 

<none> 

36035 

4273.4 

"  day 

6 

452.00 

36487 

4277.1 

-  driver 

4 

364.61 

36400 

4278.1 

-  branch 

2 

508.57 

36544 

4287.1 

-  pizzas 

1 

649.54 

36685 

4294.0 

-  temperature 

1 

1869.98 

37905 

4335.4 

-  bill 

1 

2236.19 

38271 

4347 . 6 

The  model  selection  procedure  stops  here  as  removing  any  variable  would  only 
increase  the  AIC,  not  decrease  it. 


Step:  AIC=4271 . 98 

time  ~  temperature  +  branch  +  day  +  driver  +  bill  +  pizzas 


<none> 

"  day 

-  driver 

-  branch 

-  pizzas 

-  temperature 

-  bill 


Df  Sum  of  Sq  RSS  AIC 

36053  4272.0 
6  455.62  36508  4275.9 

4  368.18  36421  4276.8 

2  513.17  36566  4285.9 

1  657.07  36710  4292.8 

1  1878.24  37931  4334.3 

1  2228.88  38282  4345.9 


Standardized  residuals 
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The  final  model,  based  on  backward  selection  with  AIC,  includes  day,  driver, 
branch,  number  of  pizzas  ordered,  temperature,  and  bill. 

(f)  Fitting  the  linear  model  with  the  variables  obtained  from  (e)  and  obtaining  the 
summary  of  it  yields  an  R ^  of  0.3092. 


mps  <-  lmCtime  ~  temperature  +  branch  +  day  +  driver  +  bill  + 

pizzas) 

summary (mps) 


This  is  only  marginally  higher  than  the  goodness  of  fit  from  the  full  model 
(0.3092  >  0.3085).  While  the  selected  model  is  better  than  the  model  with  all 
variables,  both,  with  respect  to  AIC  and  R^d y,  the  results  are  very  close  and 
remind  us  of  the  possible  instability  of  applying  automated  model  selection 
procedures. 

(g)  Both  the  normality  assumption  and  heteroscedasticity  can  be  checked  by  apply¬ 
ing  plotO  to  the  model.  From  the  many  graphs  provided  we  concentrate  on 
the  second  and  third  of  them: 


plot (mps,  which=2) 
plot (mps,  which=3) 


Figure  B.23a  shows  that  the  residuals  are  approximately  normally  distributed 
because  the  dots  lie  approximately  on  the  bisecting  line.  There  are  some  smaller 
deviations  at  the  tails  but  they  are  not  severe.  The  plot  of  the  fitted  values  versus 


Theoretical  Quantiles 

lm(time  ~  temperature  +  branch  +  day  +  driver  +  bill  +  pizzas) 

(a)  QQ-plot 


o 

c\j 


20  25  30  35  40  45 

Fitted  Values 

(b)  Fitted  values  vs.  residuals 


Fig.  B  .23  Checking  the  model  assumptions 
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the  square  root  of  the  absolute  values  of  the  residuals  shows  no  pattern;  it  is  a 
random  plot  (Fig.  B.23b).  There  seems  to  be  no  heteroscedasticity. 

(h)  Not  all  variables  identified  in  (e)  represent  necessarily  a  “cause”  for  delayed  or 
improved  delivery  time.  It  makes  sense  to  speculate  that  because  many  pizzas  are 
being  delivered  (and  need  to  be  made!)  the  delivery  time  increases.  There  might 
also  be  reasons  why  a  certain  driver  is  improving  the  delivery  time:  maybe  he  does 
not  care  about  red  lights.  This  could  be  investigated  further  given  the  results  of 
the  model  above.  However,  high  temperature  does  not  cause  the  delivery  time  to 
be  shorter;  likely  it  is  the  other  way  around:  the  temperature  is  hotter  because  the 
delivery  time  is  shorter.  However,  all  of  these  considerations  remain  speculation. 
A  regression  model  only  exhibits  associations.  If  there  is  a  significant  association, 
we  know  that  given  an  accepted  error  (e.g.  5  %),  values  of  x  are  higher  when 
values  of  y  are  higher.  This  is  useful  but  it  does  not  say  whether  x  caused  y  or 
vice  versa. 

(i)  To  check  whether  it  is  worth  to  add  a  polynomial,  we  simply  add  the  squared 
temperature  to  the  model.  To  make  R  understand  that  we  apply  a  transformation 
we  need  to  use  I  ( ) . 


mps2  <-  lm(time  ~  temperature  +  I  (temperature~2)  + 

I  (temperature~3)  +  branch.  +  day  +  driver  +  bill  +  pizzas) 
summary (mps2) 


Estimate  Std.  Error  t  value  Pr ( > I t I ) 


(Intercept)  -18.954965 
temperature  1.736692 
I (temperature~2)  -0.015544 
branchEast  -1.429772 


8.795301  -2.155  0.03134  * 

0.282453  6.149  1.05e-09  *** 

0.002247  -6.917  7.36e-12  *** 

0.416107  -3.436  0.00061  *** 


It  can  be  seen  that  the  null  hypothesis  Ho  :  /3temp2  =  0  is  rejected.  This  indicates 
that  it  is  worthwhile  to  assume  a  quadratic  relationship  between  temperature  and 
delivery  time. 

(j)  The  prediction  can  be  obtained  by  the  predict  command  as  follows: 


predict (mps , pizza  [1266 , ] ) 


The  prediction  is  36.5  min  and  therefore  0.8  min  higher  than  the  real  delivery 
time. 
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More  details  on  Chap.  3 


Proof  of  equation  (3.27) 

k 


=  \  Z  z _  *)2  =  “  Z  Z  {x‘  -  + 7xj  - 

ii/  *  1/ 


7  =  1 
k 


7=1 
k 


=  ^Z  Z^-^>2+^Z  Z 


7  =  1  ■*i€Ar; 


7  =  1 


"V*" 

m 


[»i 


+  -  Z  Z  to  -  */)©  _  *) 


7  =  1  *i€^7 


— V — 

[/  /  /  ] 


We  obtain  the  following  expressions  for  [/]-[///]: 

[‘•i—iz1  Zfe-^)2=1z 

n  r  J  J  VI  .  <  J  J  yi  <  J 


n  * — :  n  j 

7  =  1  •'  XitKj 


l  k 

[/ ;  ]  =  —  7  n  ;  (x ,  —  x)2  , 
n  1 


ni's )  > 


7  =  1 
& 


[iii]  =  -  ZX-r;  -  *)  Z  (*'  -  XJ =  2  Z(x7  -  *)  0  =  0. 

7  =  1  *^£7  7  =  1 

Since  [/]  is  the  within-class  variance  and  [ii]  is  the  between-class  variance,  Eq.  (3.27) 
holds. 
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More  details  on  Chap.  7 


Proof  of  Theorem  1.2.3.  Consider  the  interval  (xo  —  S,  xo]  with  S  >  0.  From  (7.12) 
it  follows  that  P(xo  —  5  <  X  <  xo)  =  F(xo)  —  F(x o  —  (5)  and  therefore 

P(X  =  xo)  =  lim  P(xo  —  S  <  X  <  xo) 

6^0 

=  lim[F(x0)  -  F (x0  -  S)] 

5^0 

=  F(xo)  -  F(xo)  =  0. 

Proof  of  Theorem  7.3.1. 


Var(X)  (7=?)  E(X  -  n)2 

=  E(X2  -  2/iX  +  i?) 

(7.28-7.31)  T7sV2\  o  ev  v\  i  cv  2\ 

=  E(X^)  —  2  pE(X)  +  E{(r) 

=  E(X2)  -  2/j2  +  p 

=  E(X2)  -  [£(X)]2. 


Proof  of  Theorem  7.4.1.  We  define  a  discrete  random  variable  Y  as 


0  if  \X  —  fi\  <  c 
c 2  if  \X  —  fi\  >  c. 


(C.l) 


The  respective  probabilities  are  P(\X  —  (i\  <  c )  =  p\  and  P(\X  —  p\  >  c)  =  p2 . 
The  definition  of  T  in  (C.l)  implies  that 

Y  <  \X  —  p\2 

since  for  \X  —  fi\2  <  c2  Y  takes  the  value  y\  =  0  and  therefore  Y  <  \X  —  fi\2 .  If 
\X  —  fi\2  >  c2Y  takes  the  value  y2  =  c2,  and  therefore  Y  <  \X  —  p\2.  Using  this 
knowledge,  we  obtain 

E(F)  <  E(X  -  ii)2  =  Var(X). 


However,  for  Y  we  also  have 

E(T)  =  0  •  p\  +  c2  •  p2  =  c2P(\X  -  fi\  >  c) 
which  means  that  we  can  summarize  the  above  findings  in  the  following  inequality: 

c2P(\X  -n\  >c)<  Var(X). 

This  equates  to  Tschebyschev’s  inequality.  Using  P(A)  =  1  —  F(A),  i.e. 

P(\X  —  p\  >  c)  =  1  —  P(\X  —  p\  <  c), 

we  obtain  the  second  formula  of  Tschebyschev’s  inequality: 

Var(X) 

P(\X-p\<c)>l - 
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Proof  of  rule  (7.30).  For  discrete  variables,  we  have 

E (a  +  bX)  (7=  +  bxj )/?,  =  ^  ap;  +  fox,  pt  =  a  ^  /),  +  b  ^  xt  pi . 

i  i  i 

Since  pi  =  1  and  JL  x; pi  =  E(X),  we  obtain  E(<2  +  £>X)  =  a  +  Z?E(X),  which 
is  rule  (7.30).  In  the  continuous  case,  we  have 

/OO  /*00 

(a  +  Z?v)/(v)dv  =  /  (af(x)dx  +  Z?v/(v)dv) 

-oo  7  —  oo 

/OO  1*00 

/(x)dx  +  b  /  v/(v)dv  =  a  +  Z?E(X). 

-oo  7  —  00 

Proof  of  rule  (7.33).  Using  Var(X)  =  E(  X2 )  —  E(X)2,  we  can  write  the  variance  of 
bX  as 

Var  (bX)  =  E  ([bX]2)  +  E  (bX)2. 

Using  (7.29),  we  get  E ([bX]2)  =  b2 E(X2)  and  E(Z>X)2  =  (Z>E(X))2.  Therefore 

Var  (bX)  =  b2(E(X2)  -  E(X)2)  =  b2V ar(X). 


Proof  of  p  =  l  for  a  perfect  linear  relationship.  If  Y  =  aX  +  h  with  a  7^  0,  we  get 

Cov(X,  Y)  =  E[(X  -  px)(Y  -  pY)] 

=  aE[(X  —  px)(X  ~  hx)] 

because  Co\(aX  +  h,  cY  +  d)  =  ac  Cov(X,  E),  see  p.  147.  Then 

Cov(X,  Y)  =  aVar(X), 

Var(T)  (7=3)  a2Var(X), 


and  therefore 


P(X9  Y )  = 


aW ar(X) 


-/  a2  Var(X)  Var(X)  l« 


a 


if  a  >  0.  Similarly,  if  7  =  aX  +  Z?  with  a  <  0  we  get  p(X,  Y)  =  —  1 


More  details  on  Chap.  8 


Theorem  of  Large  Numbers.  To  explain  the  Theorem  of  Large  Numbers,  we  first 
need  to  first  define  stochastic  convergence. 

Definition  C.l  A  sequence  of  random  variables,  (Xn)nef$,  converges  stochastically 
to  0,  if  for  any  e  >  0 

lim  P( \Xn\  >  e)  =  0  (C.2) 

n — >•  00 


holds. 
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This  is  equivalent  to  lim^oo  P(\Xn\  <  e)  =  1. 


Theorem  C.l  (Theorem  of  large  numbers)  Consider  n  i.i.d.  random  variables 
X\,  X2,  . . . ,  Xn  with  E(X;)  =  /i,  Var(X;)  =  a2  and  Xn  =  ^  Xi-  It  holds  that 

lim  P( \Xn  —  /jl\  <  c)  =  1,  V  c  >  0.  (C.3) 

n^oo 


This  implies  that  Xn  converges  stochastically  to  /a.  As  another  motivation,  recall 
Definition  7.6.1  where  we  define  random  variables  X\ ,  X2,  . . . ,  Xn  to  be  i.i.d.  (inde¬ 
pendently  identically  distributed)  if  all  X/  follow  the  same  distribution  and  are  inde- 

2 

pendent  of  each  other.  Under  this  assumption,  we  showed  in  (7.36)  that  Var(X)  =  . 
It  follows  that  the  larger  n ,  the  smaller  the  variance.  If  we  apply  Tschebyschev’s 
inequality  (Theorem  7.4.1)  to  X,  we  get  the  following  equation  for  (Xn  —  (i)ne 

P(\Xn  -fjL\<c)>l - =  1  -  (C.4) 

cz  ncz 

This  means  that  for  each  c  >  0,  the  right-hand  side  of  the  above  equation  tends  to  1 
as  n  — >  00  which  gives  a  similar  interpretation  as  the  Theorem  of  Large  Numbers. 


Central  Limit  Theorem.  Let  X/  (f  =  1,2,...,  n)  be  n  i.i.d.  random  variables 
with  E(Xj)  =  fi  and  Var(X/)  =  a2.  If  we  consider  the  sum  Xi,  we  obtain 
E(2/=i  Xi)  =  nF  and  Var(52”=i  X/)  =  n a2.  If  we  want  to  standardize  Yi= 1  ^ i 
we  can  use  Theorem  7.3.2  to  obtain 


—  ft/i 


(C.5) 


i.e.  it  holds  that  E(7W)  =  0  and  Var(F^)  =  1. 


Theorem  C.2  (Central  Limit  Theorem)  Let  X/  (i  =  1,2,  ...  ,n)  be  n  i.i.d.  random 
variables  with  E (Xj)  =  fj,  and  Var (X/)  =  cr2.  Yn  denotes  the  standardized  sum  of 
Xi ,  i  =  1 ,  2,  . . . ,  n.  The  CDF  of  Yn  is 

lim  P(Yn  <y)  =  V  y, 

n^oo 

where  d>{y)  denotes  the  CDF  of  the  standard  normal  distribution. 


This  theorem  tells  us  that  Yn  is  approximately  standard  normal  distributed  if  n  is 
large,  i.e. 


Yn  ~  N( 0,  1)  as  n  00. 


This  is  equivalent  to  saying  that  1  Xi  *s  approximately  N(nfi,  na2)  distributed 
if  n  is  large,  i.e. 

n 

Xi  ~  N(n/i,  na2). 
i  —  \ 


(C.6) 
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As  a  consequence  Xn 


1 

n  =  \ 


Xi  is  N(jjl, 


^-)  distributed  for  large  n ,  i.e. 


In  summary,  the  Theorem  of  Large  Numbers  tells  us  that  Xn  converges  stochas¬ 
tically  to  /i,  whereas  the  Central  Limit  Theorem  tells  us  that  Xn  converges  to  a 

2 

N(fi,  ^-) -distribution  as  n  tends  to  infinity. 


PDF  of  the  x2  -Distribution.  The  PDF  of  the  x2 -distribution,  with  n  degrees  of 
freedom,  is  defined  as 


/(*)  = 


xn/2  1  exp(— x/2) 
r  (n/2)2n/2 

0 


if  v  >  0 
otherwise. 


(C.7) 


Note  that  r(n)  is  the  Gamma  function,  defined  as  r in)  =  ( n  —  1)!  for  positive  inte¬ 
gers  and  r(x)  =  /q°°  tx~l  exp(— t)dt  otherwise. 


PDF  of  the  /  -Distribution.  The  PDF  of  the  /  -distribution,  with  n  degrees  of  freedom, 
is  defined  as 


fix)  = 


r(*&) 


r(n/2)y/mr 


K) 


■(»+l)/2 


—  00  <  V  <  oo. 


(C.8) 


Note  that  r(n)  is  the  Gamma  function,  defined  as  r(n)  =  (n  —  1)!  for  positive  inte¬ 
gers  and  r(x)  =  /0°°  tx~l  exp(— t)dt  otherwise. 


PDF  of  the  F -Distribution.  The  PDF  of  the  F -distribution,  with  n  and  m  degrees 
of  freedom,  respectively,  is  defined  as 


fix)  = 


p/n+m\(  n  \n/2  n/2—\ 
_ 2  2  '_rn_ _ 

r(|)r(f)(l  +  M)(/i+m)/2’ 


v  >  0. 


(C.9) 


The  PDF  of  the  F -distribution  with  m  and  n  degrees  of  freedom  can  be  derived  by 
interchanging  the  roles  of  m  and  n. 


More  details  on  Chap.  9 


Another  Example  of  Sufficiency.  Let  X\,  X2, . . . ,  Xn  be  a  random  sample  from 
N(fi,  a2)  where  /x  and  a2  are  both  unknown.  We  attempt  to  find  a  sufficient  statistic 
for  fi  and  a2. 
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f(xi,X2-..,xn\ii,a2)  = 


1  -A,  (Xj  -  n )2 

2  Z-*  a- 

i  =  1 


Here  the  joint  density  depends  on  xi,x2,-..,xn  through  two  statistics 
h(xi,x2,  ...,xn)  =  Y!i=ixi  and  t2(x  i,  x2, . . . ,  xn)  =  with  h(x  i,x2, 

. . .  ,xn)  =  1.  Thus  T\  =  X-=i  and  T2  =  X2  are  jointly  sufficient  for  fi 
and  cr2.  Therefore,  X  and  .S'2  =  ^-j-  2^=1(X;  —  X)2  are  jointly  sufficient  for  p  and 

a2  as  they  are  a  one-to-one  function  of  T\  and  T2.  They  are  the  maximum  likelihood 
estimates  for  p  and  a2. 


More  details  on  Chap.  10 


Exact  Test  of  Fisher.  Similar  to  the  approximate  two-sample  binomial  test  in 
Sect.  10.4.2,  we  assume  two  samples  following  a  binomial  distribution  with  para¬ 
meters  (ni,  p\)  and  ( n2 ,  p2),  respectively. 

X  =  (XuX2,...,Xni),  Xi~B(V,p  i) 

Y  =  (Ti,  Y2, . . . ,  Yn2),  Yi~B(l;p2). 

For  the  sums  of  these  random  variables,  we  get: 

n\  n2 

x  =  Y,xi  ~  s(m;  pi),  J7  =  ZA  ~  pi). 

i  =  1  i  =  1 

Let  Z  =  X  +  T.  The  Exact  Test  of  Fisher  uses  the  fact  that  the  row  marginal  fre¬ 
quencies  n\  and  n2  in  the  following  table 


Success 

Failure 

Total 

Population  A 

X 

m  -  X 

n\ 

Population  B 

Z  —  X  =  Y 

n2  —  (Z  —  X) 

n2 

Total 

Z 

(n\  +n2  -  Z) 

n\  +  n2 

are  fixed  by  the  sample  sizes  n\  and  n2 .  Conditional  on  the  total  number  of  successes 
Z  =  z  (i.e.  the  column  margins  are  assumed  to  be  fixed),  the  only  remaining  random 
variable  is  X  (since  the  other  three  entries  of  the  table  are  then  determined  by  the 
realization  v  of  X  and  the  margins).  Under  Hq  :  p\  =  p2  =  p,  it  can  be  shown  that 


X  ~  H{n ,  ni,z), 
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i.e. 

p  (X  =  x\Z  =  z)  = 

\Z 

The  proof  is  straightforward  using  the  idea  of  conditioning  on  Z  —  z  and  the  assump¬ 
tion  of  independence  of  the  two  samples: 


P(X  =  x\Z  =  z) 


P(X  =  x,Z  =  z) 
P{Z  —  z) 


P(X  =  x,Y  =z-x) 

P(Z  =  z) 

P(X  =x)P(Y  =  z-x) 

P(Z  =  z) 

(“V(  1  -  p)ni~X{znZx)pZ-X(  1  -  P)',2-(Z^) 

Q)pza-p)n~z 

CHAO  Od”1) 

G)  (”)  ’ 


Note  that  in  the  equation  above  we  use  the  fact  that  under  Ho,  Z  =  X  +  Y  is  B(n,  p) 
distributed;  see  the  additivity  theorem  for  the  binomial  distribution,  i.e.  Theorem 


8.1.1. 


Example  C.  1  Consider  two  competing  lotteries  A  and  B .  Say  we  buy  10  tickets  from 
each  lottery  and  test  the  hypothesis  of  equal  winning  probabilities.  The  data  can  be 
summarized  in  a  2  x  2  table: 


Winning  Not  winning 

Total 

Lottery  A 
Lottery  B 

1  24 

7  18 

25 

25 

Total 

8  42 

50 

In  R,  we  can  use  the  command  f  isher .  test  to  perform  the  test.  Using  the  exam¬ 
ple  data  and  H\  :  p\  ^  p2,  we  get 


ft  <-  matrix (nrow=2 ,ncol=2 , data=cbind(c (1 , 7) ,  c(24,18))) 
fisher .test (x=f t) 


with  output 


Fisher’s  Exact  Test  for  Count  Data 

data:  fisher. table 

p-value  =  0 . 0488 

alternative  hypothesis:  true  odds  ratio  is  not  equal  to  1 


430 


Appendix  C:  Technical  Appendix 


95  percent  confidence  interval: 

0.002289885  0.992114690 
sample  estimates: 
odds  ratio 
0.1114886 

For  the  example  data  and  a  =  0.05,  the  null  hypothesis  is  rejected,  since  the  p -value 
is  lower  than  a.  For  the  calculation  of  the  p-  value,  the  one-sided  and  two-sided  cases 
have  to  be  distinguished.  The  idea  is  that  while  fixing  the  margins  at  the  observed 
values  (i.e.  25,  25,  8,  42),  we  have  to  calculate  the  sum  of  the  probabilities  of  all 
tables  which  have  lower  probability  than  the  observed  table.  In  R ,  one  can  use  the 
functions  dhyper  and  phyper  for  calculating  (cumulative)  probabilities.  For  exam¬ 
ple,  P(X  =  1 1 Z  =  8)  can  be  calculated  as 


dhyper (1,25,25,8) 


[1]  0.02238402 

A  more  extreme  table  than  the  observed  one  is 


0  25 

8  17 

25 

25 

8  42 

50 

with  probability  P(X  =  0)  =  dhyper  (0,25,25,8)  =  0.002,  which  is  lower  than 
P(X  =  1).  The  sum  is  0.0224  +  0.002  =  0.0244  which  is  the  (left)  one-sided  p- 
value.  In  this  case  (not  generally  true!),  the  two-sided  y>-value  is  simply  two  times 
the  one-sided  value,  i.e.  2  •  0.0244  =  0.0488. 


Remark  C.l  The  two-sided  version  of  the  Exact  Test  of  Fisher  can  also  be  used  as 
a  test  of  independence  of  two  binary  variables.  It  is  equivalent  to  the  test  of  equality 
of  two  proportions,  see  Example  10.8.2. 


One-Sample  %2-Test  for  Testing  Hypothesis  About  the  Variance.  We  assume  a 
normal  population,  i.e.  X  ~  N(p,  a2)  and  an  i.i.d.  sample  (X\ ,  X2,  . . . ,  Xn)  distrib¬ 
uted  as  X.  We  only  introduce  the  test  for  two-sided  hypotheses 

H0  :  cr2  =  ctq 
Hi  :  a2  /  ctq. 


versus 
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The  test  statistic 


T(X)  = 


(n  -  1  )S 


x 


cr, 


0 


follows  a  Xn_i -distribution  under  Ho.  The  critical  region  is  constructed  by  taking 
the  a/ 2-  and  (1  —  a/ 2)  quantile  as  critical  values;  i.e.  Ho  is  rejected,  if 


t(x)  <  Cn—l\a/ 2 


or  if 


t(x)  >  Cn— 1 ;  1 — a  I'll 


where  cn-\-a/2  and  cn- i-i-a/2  are  the  desired  quantiles  of  a  x2 -distribution.  In  R , 
the  test  can  be  called  by  the  sigma  .test  function  in  the  TeachingDemos  library  or 
the  varTest  function  in  library  EnvStats.  Both  functions  also  return  a  confidence 
interval  for  the  desired  confidence  level.  Note  that  the  test  is  biased.  An  unbiased 
level  a  test  would  not  take  a/ 2  at  the  tails  but  two  different  tail  probabilities  a\  and 
ol2  with  a\  +  ol2  =  ol. 


F-Test  for  Comparing  Two  Variances.  Comparing  variances  can  be  of  interest 
when  comparing  the  variability,  i.e.  the  “precision”  of  two  industrial  processes;  or 
when  comparing  two  medical  treatments  with  respect  to  their  reliability.  Consider 
two  populations  characterized  by  two  independent  random  variables  X  and  Y  which 
follow  normal  distributions: 

X~N(jjtx,(rx),  Y  ~N(jjiy,o$). 

For  now,  we  distinguish  the  following  two  hypotheses: 

Ho  :  =  <Jy  versus  :  a\  /  <jy,  two-sided 

Ho  :  <  cry  versus  H\  :  a\  >  one-sided. 

The  third  hypothesis  with  H\  :  o\  <  is  similar  to  the  second  hypothesis  where 
X  and  Y  are  replaced  with  each  other. 

Test  Statistic 


Let  (X\ ,  X2,  . . . ,  Xni )  and  (Y\ ,  Y2,  . . . ,  Yni)  be  two  independent  random  samples 
of  size  n\  and  n 2.  The  test  statistic  is  defined  as  the  ratio  of  the  two  sample  variances 


T  (X,  Y) 


which  is,  under  the  null  hypothesis,  F-distributed  with  n\ 
of  freedom,  see  also  Sect.  8.3.3. 


(C.10) 


1  and  ^2  —  1  degrees 
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Critical  Region 

Two-Sided  Case.  The  motivation  behind  the  construction  of  the  critical  region 
for  the  two-sided  case,  Ho:  cr\  =  vs.  H\ :  7^  ,  is  that  if  the  null  hypothesis 

is  true  (i.e.  the  two  variances  are  equal)  then  the  test  statistic  (C.10)  would  be  1; 
also,  T (X,  Y)  >  0.  Therefore,  very  small  (but  positive)  and  very  large  values  of 
T(X,  Y)  should  lead  to  a  rejection  of  Ho.  The  critical  region  can  then  be  written  as 
K  =  [0,  £1)  U  (&2,  00),  where  k\  and  k 2  are  critical  values  such  that 

P(T (X,  Y)  <  k\\Ho)  =  a/2 
P(T(X,  Y)  >k2\H0)  =  a/2. 

Here  k\  and  k 2  can  be  calculated  from  the  quantile  function  of  the  F -distribution  as 


kl  —  fn\  —  —  1  ,oj/2  ?  ^2  —  fn\  — 1,722  — 1,1— cm/2- 


Example  C.2  Let  =  50,  ft  2  =  60  and  a  =  0.05.  Using  the  qf  command  in  R ,  we 
can  determine  the  critical  values  as: 

qf (q=0 . 025 ,  df 1=50-1,  df 2=60-1) 
qf (q=0 . 975 ,  df 1=50-1,  df 2=60-1) 


The  results  are  k\  =  0.5778867  and  k 2  =  1.706867. 


Remark  C.  2  There  is  the  following  relationship  between  quantiles  of  the  F  -distribution: 


l;a/2 


1 


l;l-a/2 


One-Sided  Case.  In  the  one-sided  case,  the  alternative  hypothesis  is  always  for¬ 
mulated  in  such  a  way  that  the  variance  in  the  numerator  is  greater  than  the  vari¬ 
ance  in  the  denominator.  The  hypotheses  are  Ho :  <7%  <  <Jy  versus  Hi  :  g\  >  <j\ 
or  Ho'.  v\lci\  <  1  versus  H\  \  cr\/( >  1.  This  means  it  does  not  matter  whether 
H\  :  o\  >  (jy  or  Hi  :  ;  by  constructing  the  test  statistic  in  the  correct  way, 

we  implicitly  specify  the  hypothesis.  The  critical  region  K  consists  of  the  largest 
values  of  T(X,  Y),  i.e.  K  =  (k,  00),  where  k  has  to  fulfil  the  condition 

P(T(X,  Y)  >  k\Ho)  =  a. 

The  resulting  critical  value  is  denoted  by  k  =  fni- i;n2_i;i_a. 
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Observed  Test  Statistic 


Using  the  sample  variances,  the  realized  test  statistic  t  is  calculated  as: 


t(x,  y)  =  S2X  = 


y 


1  n\ 

— y  (xi  - 
ni- 

/  — l 


1 


-\2  2 

x)  »  ^  , 

ri2  —  1 


»2 


i=\ 


Test  Decisions 


Case 

Ho 

Hi 

Reject  Ho,  if 

(a) 

ax  =  cry 

Gx  7^  Gy 

y)  >  fni  —  l',ri2  —  l',l—ct/2  Or 

t(x,  y)  <  fn\  —  l;n2—l',a/2 

(b) 

ax  <  cry 

ax  >  Gy 

t{%,  y)  >  fni  —  l;n2  —  l’,l—cx 

Remark  C.3  We  have  tacitly  assumed  that  the  expected  values  fix  and  fiy  are 
unknown  and  have  to  be  estimated.  However,  this  happens  rarely,  if  ever,  in  practice. 
When  estimating  the  expected  values  by  the  arithmetic  means,  it  would  be  appropriate 
to  increase  the  degrees  of  freedom  from  n\  —  1  to  n\  and  ri2  —  1  to  ri2.  Interestingly, 
standard  software  will  not  handle  this  case  correctly. 


Example  C.3  A  company  is  putting  baked  beans  into  cans .  Two  independent  machines 
at  two  sites  are  used.  The  filling  weights  are  assumed  to  be  normally  distributed  with 
mean  1000  g.  It  is  speculated  that  one  machine  is  more  precise  than  the  other.  Two 
samples  of  the  two  machines  give  the  following  results: 


Sample 

2 

n  x  s 

X 

Y 

20  1000.49  72.38 

25  1000.26  45.42 

With  a  =  0.1  and  the  hypotheses 

2  2  2  2 

Ho  :  ax  =  aY  versus  H\  :  g~x  /  aY, 

we  get  the  following  quantiles 
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qf  (0.05,  20-1,  25-1)  R 

[1]  0.4730049 

qf  (0 . 95 ,  20-1,  25-1) 

[1]  2.039858 


The  observed  test  statistic  is 

_  72.38 
1  ~  45.42 


1.59. 


Therefore,  Ho  is  not  rejected,  since  k\  <  t  <  k^.  We  cannot  reject  the  hypothesis 
of  equal  variability  of  the  two  machines.  In  R ,  the  F- test  can  be  used  using  the 
command  var  .test. 


Remark  C.4  For  the  t-test,  we  remarked  that  the  assumption  of  normality  is  not  cru¬ 
cial  because  the  test  statistic  is  approximately  normally  distributed,  even  for  moderate 
sample  sizes.  However,  the  F-test  relies  heavily  on  the  assumption  of  normality.  This 
is  why  alternative  tests  are  often  used,  for  example  the  Levene’s  test. 


More  details  on  Chap.  1 1 


Obtaining  the  Least  Squares  Estimates  in  the  Linear  Model.  The  function  S(a ,  b) 
describes  our  optimization  problem  of  minimizing  the  residual  sum  of  squares: 

n  n 

S(a,  b)  =  ^ef  =  7>,  -  a-  bxt )2. 
i= 1  i= 1 


Minimizing  S(a,  b)  is  achieved  using  the  principle  of  maxima  and  minima  which 
involves  taking  the  partial  derivatives  of  S(a,  b)  with  respect  to  both  a  and  b  and 
setting  them  equal  to  0.  The  partial  derivatives  are 


d 

da 


S(a,  b) 


d 

db 


S(a,  b) 


n 

-2  Ypi-a-bxi),  (C.ll) 

i=l 
n 

-2  y.Cyi  -  a  -  bxi)xj. 
i= 1 


Now  we  set  (C.ll)  and  (C.12)  as  equal  to  zero,  respectively: 


(i)  Z?=i(y.-  -a-bxi)  =  0, 

(II)  E?=iCVi  -a-bxOxt  =  0. 


(C.12) 
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This  equates  to 


(I')  na+b  Y!!=i  xi 

(no  a  Z?=.  *-■  +  *  Z?=i 


Zn 

i=\yi . 

Zn 

i=ixiyt 


Multiplying  (!')  by  j-  yields 


a  +  bx  =  y 


which  gives  us  the  solution  for  a : 

a  =  y  —  bx, 

Putting  this  solution  into  (IT)  gives  us 


n 


n 


n 


(y  -  bx)  ^Xj  +b^xf  =  ^ -v, y, 


i  =  1 


/  — 1  /  — 1 


Using  Z/=i  =  nx  leads  to 


If  we  use 


and 


9  —  9 

b  I  y  xf  —  nx  i  =  7  Xiyi  —  nxy, 


n 


n 


Yxj  -  nx2  =  Yj(xi  -  x)2  =  sxx 


/  — 1 


i  =  \ 


n 


n 


Y,Xiyi  ~  nxy =  Y(Xi  ~  x^yi  ~ 3') =  Sxy’ 


i=\ 


i  =  \ 


we  eventually  obtain  the  least  squares  estimate  of  b\ 

/V 

b  SXx  = 

£  _  Zr  -  Z?=i (**  -  *)(?*  -  t) 


ZLife-*)2 


Remark  C.5  To  show  that  the  above  solutions  really  relate  to  a  minimum,  and  not 
to  a  maximum,  we  would  need  to  look  at  all  the  second-order  partial  derivatives  of 
S(a,  b)  and  prove  that  the  bordered  Hessian  matrix  containing  these  derivatives  is 
always  positive  definite.  We  omit  this  proof  however. 
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Variance  Decomposition. 

We  start  with  the  following  equation: 

yt  -  9i  =  ( yt  -y)  -  iyi  -  y)- 

If  we  square  both  sides  we  obtain 

n  n  n  n 

X/ yt  -  y^2  =  _  y )2  +  _  ^)2  _  -  y^  ~ 

i  =  1  i  =  1  i  =  1  i  =  1 

The  last  term  on  the  right-hand  side  is 


72 


&•  ~  y)(y;  -  y)  a=8)  -  y)M*/  -  *) 


72 


7  —  1 


7  —  1 


-  (11.6)  -2  c  (11.8)^.^  -x2 

=  o  sxy  =  b  sxx  =  2^(yt  -  y)  • 


72 


7  —  1 


We  therefore  obtain 


72 


72 


72 


~  i5;)2  =  Z(;V'  “  “  Z(^'  “  -y)2’ 


7  —  1 


7  —  1 


7  —  1 


which  equates  to 


±<»  -  y )2  =  1> 

7  —  1  7  —  1 

The  Relation  between  R 2  and  r. 


y)2  +  Z(>v  -  ^-)2- 

7  —  1 


72 


6 Residual  =  Z(yi  “  («  +  ^i))2  (1=8>  Zifa  “  “  b(Xi  ~  x)f 


72 


7  —  1 

=  Syy  +  PSXX  -  2M 


7  —  1 


c  £2C  _  c  (^y)‘ 

Oyy  O  0XX  Jyy 


SQk 


egression  —  ^yy 


—  Syy  SQ  Residual  — 


(SXy) 


XX 

2 


1 XX 
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We  therefore  obtain 


R 


2 


SQ  r 


egression 


$xx  Syy 


The  Least  Squares  Estimators  are  Unbiased. 


E(/3)  =  EtfX'Xr'X'y) 


Given  that  X  in  the  model  is  assumed  to  be  fixed  (i.e.  non-stochastic  and  not  following 
any  distribution),  we  obtain 


E(/3)  =  (X,X)_1X,E(y). 


Since  E(e)  =  0  it  follows  that  Ey  =  X/3  and  therefore 


E((3)  =  (X'X^X'X^  =  (3. 

How  to  Obtain  the  Variance  of  the  Least  Squares  Estimator.  With  the  same 
arguments  as  above  (i.e  X  is  fixed  and  non- stochastic)  and  applying  the  rule 
Var (bX)  =  b2V ar(X)  from  the  scalar  case  to  matrices  we  obtain: 

Var(/3)  =  VarfXX'Xr'X'y)  =  (X,X)“1X,Var(y)X(X,X)“1  =  cr2(X'X)_1. 

Maximum  Likelihood  Estimation  in  the  Linear  Model.  The  linear  model  follows 
a  normal  distribution: 


y  =  X/3  +  e  ~  TV  (X/3,  a2I). 


Therefore,  the  likelihood  function  of  y  also  follows  a  normal  distribution: 


L(/3,  a2)  =  (27 rcr2)  n^2  exp 


-^(y-X/3y(y-X/3) 


The  log-likelihood  function  is  given  by 


o  YL  o  1 

-Z\  _  - Z' 


W  ^ )  =  --  ln(2^z)  -  ^  (y  -  X/3)  (y  -  X/3), 
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To  obtain  the  maximum  likelihood  estimates  of  (3  and  a2,  one  needs  to  obtain  the 
maxima  of  the  above  function  using  the  principle  of  maxima  and  minima  that  involves 
setting  the  partial  derivatives  equal  to  zero  and  finding  the  solution: 


dl 
d/3 
dl 
da 2 


1 


2a2 


2X'(y  -  X/3)  =  0, 


n 


+ 


1 


2cr2  2(a2) 


2\2 


(y-X/3),(y-X/3)  =  0. 


We  therefore  have 


X'X/3  =  X'y,  or/3  =  (X'X)“'X'y 
a2  =  -(y  -  X/3)'(y  -  X/3) 


n 


which  give  us  the  ML  estimates  of  (3  and  a2  in  the  linear  regression  model.  Here, 
we  also  need  to  check  the  Hessian  matrix  of  second-order  partial  derivatives  to  show 
that  we  really  found  a  minimum  and  not  a  maximum.  We  omit  this  proof  however. 


Distribution  Tables 


See  Tables  C.l,  C.2  and  C.3. 


Table  C.1  CDF  values  for  the  standard  normal  distribution,  0(z).  These  values  can  also  be  obtained  in  R  by  using  the  pnorm(p)  command 
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Table  C.2  (1  —  a)  quantiles  for  the  t -distribution.  These  values  can  also  be  obtained  in  R  using 
the  qt  (p ,  df )  command. 


df 

1  —  a 

0.95 

0.975 

0.99 

0.995 

1 

6.3138 

12.706 

31.821 

63.657 

2 

2.9200 

4.3027 

6.9646 

9.9248 

3 

2.3534 

3.1824 

4.5407 

5.8409 

4 

2.1318 

2.7764 

3.7469 

4.6041 

5 

2.0150 

2.5706 

3.3649 

4.0321 

6 

1.9432 

2.4469 

3.1427 

3.7074 

7 

1.8946 

2.3646 

2.9980 

3.4995 

8 

1.8595 

2.3060 

2.8965 

3.3554 

9 

1.8331 

2.2622 

2.8214 

3.2498 

10 

1.8125 

2.2281 

2.7638 

3.1693 

11 

1.7959 

2.2010 

2.7181 

3.1058 

12 

1.7823 

2.1788 

2.6810 

3.0545 

13 

1.7709 

2.1604 

2.6503 

3.0123 

14 

1.7613 

2.1448 

2.6245 

2.9768 

15 

1.7531 

2.1314 

2.6025 

2.9467 

16 

1.7459 

2.1199 

2.5835 

2.9208 

17 

1.7396 

2.1098 

2.5669 

2.8982 

18 

1.7341 

2.1009 

2.5524 

2.8784 

19 

1.7291 

2.0930 

2.5395 

2.8609 

20 

1.7247 

2.0860 

2.5280 

2.8453 

30 

1.6973 

2.0423 

2.4573 

2.7500 

40 

1.6839 

2.0211 

2.4233 

2.7045 

50 

1.6759 

2.0086 

2.4033 

2.6778 

60 

1.6706 

2.0003 

2.3901 

2.6603 

70 

1.6669 

1.9944 

2.3808 

2.6479 

80 

1.6641 

1.9901 

2.3739 

2.6387 

90 

1.6620 

1.9867 

2.3685 

2.6316 

100 

1.6602 

1.9840 

2.3642 

2.6259 

200 

1.6525 

1.9719 

2.3451 

2.6006 

300 

1.6499 

1.9679 

2.3388 

2.5923 

400 

1.6487 

1.9659 

2.3357 

2.5882 

500 

1.6479 

1.9647 

2.3338 

2.5857 
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Table  C.3  (1  —  a)  quantiles  of  the  x2 -distribution.  These  values  can  also  be  obtained  in  R  using 
the  qchisq(p,df)  command 


df 

1  —  a 

0.01 

0.025 

0.05 

0.95 

0.975 

0.99 

1 

0.0001 

0.001 

0.004 

3.84 

5.02 

6.62 

2 

0.020 

0.051 

0.103 

5.99 

7.38 

9.21 

3 

0.115 

0.216 

0.352 

7.81 

9.35 

11.3 

4 

0.297 

0.484 

0.711 

9.49 

11.1 

13.3 

5 

0.554 

0.831 

1.15 

11.1 

12.8 

15.1 

6 

0.872 

1.24 

1.64 

12.6 

14.4 

16.8 

7 

1.24 

1.69 

2.17 

14.1 

16.0 

18.5 

8 

1.65 

2.18 

2.73 

15.5 

17.5 

20.1 

9 

2.09 

2.70 

3.33 

16.9 

19.0 

21.7 

10 

2.56 

3.25 

3.94 

18.3 

20.5 

23.2 

11 

3.05 

3.82 

4.57 

19.7 

21.9 

24.7 

12 

3.57 

4.40 

5.23 

21.0 

23.3 

26.2 

13 

4.11 

5.01 

5.89 

22.4 

24.7 

27.7 

14 

4.66 

5.63 

6.57 

23.7 

26.1 

29.1 

15 

5.23 

6.26 

7.26 

25.0 

27.5 

30.6 

16 

5.81 

6.91 

7.96 

26.3 

28.8 

32.0 

17 

6.41 

7.56 

8.67 

27.6 

30.2 

33.4 

18 

7.01 

8.23 

9.39 

28.9 

31.5 

34.8 

19 

7.63 

8.91 

10.1 

30.1 

32.9 

36.2 

20 

8.26 

9.59 

10.9 

31.4 

34.2 

37.6 

25 

11.5 

13.1 

14.6 

37.7 

40.6 

44.3 

30 

15.0 

16.8 

18.5 

43.8 

47.0 

50.9 

40 

22.2 

24.4 

26.5 

55.8 

59.3 

63.7 

50 

29.7 

32.4 

34.8 

67.5 

71.4 

76.2 

60 

37.5 

40.5 

43.2 

79.1 

83.3 

88.4 

70 

45.4 

48.8 

51.7 

90.5 

95.0 

100.4 

80 

53.5 

57.2 

60.4 

101.9 

106.6 

112.3 

90 

61.8 

65.6 

69.1 

113.1 

118.1 

124.1 

100 

70.1 

74.2 

77.9 

124.3 

129.6 

135.8 

Quantiles  of  the  F -Distribution.  These  quantiles  can  be  obtained  in  R  using  the 
qf  (p ,  df  1 ,  df  2 )  command. 
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Summary  of  Tests  for  Continuous  and  Ordinal  Variables 
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Summary  of  Tests  for  Nominal  Variables 
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CDF,  133 
expectation,  144 
normal  random  variables,  168 
probabilities,  117 
variance,  144 
Causation,  288 
CDF,  129 

calculation  rules,  133 


joint,  141 
quantile,  137 
quartile,  137 

Central  limit  theorem,  426 
Central  tendency,  38 
Certain  event,  110 

x2 

distribution,  171,  427 
goodness-of-fit  test,  235 
independence  test,  238 
test  of  homogeneity,  240,  241 
variance  test,  430 
Coding 
dummy,  266 
effect,  267 
Coefficient 
binomial,  102 
of  variation,  55 
regression,  251 
Combinations,  102 
with  order,  103,  104 
with  replacement,  103,  104 
without  order,  102,  103 
without  replacement,  102,  103 
Combinatorics,  97 
Complementary  event,  110 
Composite  event,  110 
Conditional 
distribution,  141,  143 
frequency  distribution,  70 
probability,  117 

relative  frequency  distribution,  70 
Confidence 
bound,  196 
interval,  196,  197 
interval  for  /i\  a1  known,  197 
interval  for  \i\  a1  unknown,  198 
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interval  for  p,  199 
interval  for  the  odds  ratio,  201 
level,  196 
Consistency,  189 
Contingency 
coefficient,  77 
table,  68,  140 
Continuous  variable,  6 
Convergence 
stochastic,  425 
Correlation  coefficient,  148 
of  Bravais-Pearson,  82 
of  Spearman,  84 
product  moment,  82 
Covariance,  146,  147 
Covariate,  25 1 
Cramer’s  V,  77 

Cross  tabulation,  see  contingency  table 
Cumulative 

distribution  function,  129 
frequency,  see  frequency,  cumulative 
marginal  distribution,  143 

D 

Data 

matrix,  3,  9 
observation,  3 
set,  3,  9 

transformation,  11 
unit,  3 

Decomposition 
complete,  113 

Degenerate  distribution,  156 
Density,  129 
Design  matrix,  263,  270 
Dispersion,  48 
absolute  deviation,  5 1 
absolute  mean  deviation,  5 1 
absolute  median  deviation,  5 1 
mean  squared  error,  5 1 
measure,  49 
range,  49 

standard  deviation,  5 1 
Distribution,  19 
Bernoulli,  156 
Binomial,  157 
x2,  171,427 
conditional,  141,  143 
conditional  frequency,  70 
conditional  relative  frequency,  70 
continuous,  165 


cumulative  marginal,  143 

degenerate,  156 

exponential,  170 

F,  427 

Gauss,  166 

geometric,  163 

hypergeometric,  163 

independent  and  identical,  145,  426 

joint  relative  frequency,  70 

marginal,  140,  142 

marginal  frequency,  70 

marginal  relative  frequency,  70 

multinomial,  161 

normal,  166 

Poisson,  160 

standard,  154 

Student,  172 

t,  172,  427 

uniform  discrete,  154 
Duality,  216 
Dummy  variable,  266 

E 

Efficiency,  185 
Elementary  event,  110 
Empirical  cumulative  distribution  function 
(ECDF),  19 

Epitools,  see  R  packages 
Error 

type  I,  213 
type  II,  213 
Estimation 
interval,  195 

least  squares,  252,  253,  264 
maximum  likelihood,  192 
method  of  moments,  195 
nonparametric,  182 
parametric,  182 
Event,  110 

additive  theorem,  116 
certain,  110 
composite,  110 
disjoint,  112 
elementary,  110 
impossible,  110 
simple,  110 
sure,  110 

theorem  of  additivity,  115,  116 
Expectation,  134 
calculation  rules,  144 
Expected  frequencies,  74 
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Experiment 
Laplace,  114 
random,  109 

Exponential  distribution,  170 

F 

Factorial  function,  see  function,  factorial 

F-distribution,  173,  427 

Fisher 

exact  test,  230 
Foreign,  see  R  packages 
Frequency 
absolute,  18,  69,  113 
cumulative,  19 
expected,  74 
relative,  18,  21,  113 
table,  19,  69 
F-Test,  272,  431 
Function 

cumulative  distribution,  see  CDF 
empirical  cumulative  distribution,  see 
ECDF 

factorial,  100 

joint  cumulative  distribution,  141 
joint  probability  distribution,  140,  141 
probability  mass,  see  PMF 
step,  133 

G 

Gamma 

of  Goodman  and  Kruskal,  87 
Gauss  test 
one- sample,  216 
two- sample,  221 

Generalized  method  of  moments,  195 
Geometric  distribution,  163 
Ggplot2,  see  R  packages 
Gini  coefficient,  60 
standardized,  61 
Goodman  and  Kruskal’ s  7,  87 
Goodness  of  fit 
adjusted  measure,  281 
measure,  258 
test,  235 
Graph 

bar  chart,  24 
box  plot,  56 
histogram,  27 
kernel  density  plot,  29 
Lorenz  curve,  58 
pie  chart,  26 


QQ-plot,  44 
random  plot,  286 
scatter  plot,  80 
Growth 
factor,  46 
rate,  46 

H 

Heteroscedasticity,  286 
Histogram,  27 
Homoscedasticity,  286 
Hypergeometric  distribution,  163 
Hypothesis,  210 
alternative,  211 
linear,  280 
null,  211 
one-sided,  211 
two-sided,  211 

I 

i.i.d.,  145,  153,  426 
Impossible  event,  110 
Independence,  73,  121 
pairwise,  122 
random  variables,  143 
stochastic,  121,  144 
Ineq,  see  R  packages 
Inequality 
Tschebyschev,  139 
Inference,  181,  182 
least  squares,  252,  264 
maximum  likelihood,  192 
method  of  moments,  195 
Interaction,  276 
Intercept,  25 1 
Interquartile  range,  49 
Interval  estimation,  195 

J 

Joint 

cumulative  distribution  function,  141 
frequency  distribution,  70 
probability  distribution  function,  140, 
141 

relative  frequency  distribution,  70 

K 

Kernel  density  plot,  29 
Kolmogorov-Smimov  test,  237 
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L 

Laplace 

experiment,  114 
probability,  114 
Lattice,  see  R  packages 
Least  squares,  252 
Life  time,  170 
Likelihood,  192 
Line  of  equality,  59 
Linear 

hypotheses,  280 
Linear  model,  25 1 
residuals,  270 
Linear  regression,  249 
interaction,  276 
Location  parameter,  38 
Log-linear  model,  269 
Lorenz  curve,  58 

M 

Mann- Whitney  U-test,  232 
Marginal 

distribution,  140,  142 
frequency  distribution,  70 
relative  frequency  distribution,  70 
MASS  ,  see  R  packages 
Matrix 

covariance,  147 
design,  263 

Maximum  likelihood  estimation  (MLE), 
192 

Mean 

arithmetic,  38 
properties,  40 
weighted  arithmetic,  38 
Mean  squared  error  (MSE),  51,  184 
Measure 
dispersion,  49 
symmetric,  76 
Measure  of  association 
X2  coefficient,  76 
contingency  coefficient  C,  77 
correlation  coefficient,  82 
Cramer’s  V,  11 
odds  ratio,  78 

rank  correlation  coefficient,  84 
relative  risk,  78 
Memorylessness,  170 
Method  of  moments,  195 
Model 

fitted  regression  model,  253 


fitted  value,  253 
linear,  25 1 
log-linear,  269 
nonlinear,  251 

Multinomial  distribution,  161 
Multiple  linear  regression,  262 
Multiplication  theorem  of  probability,  119 
Multivariate,  249 
Mvtnorm,  see  R  packages 

N 

Namibia,  206 
Newton-Raphson,  193 
Nominal  variable,  6 
Normal  distribution,  166 

O 

Observation,  3 
Odds  ratio,  78 

One- sample  problem,  209,  210 
Ordered 
set,  99 
values,  20 
Ordinal  variable,  6 
Outcome,  25 1 

P 

Parameter 
location,  38 
regression,  251 
space,  184 
PDF,  129 
joint,  140,  141 
Percentile,  43 
Permutation,  101 
with  replacement,  101 
without  replacement,  101 
Pie  chart,  26 
Plot 

kernel  density,  29 
QQ,  44 
random,  286 
scatter,  79 
trumpet,  286,  288 
Poisson  distribution,  160 
Polynomial  regression,  267 
Population,  4 
Power,  213 
Probability 
calculation  rules,  117 
conditional,  117,  119 
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density  function,  129 
Laplace,  114 
mass  function,  132 
posterior,  120 
prior,  120 
Probability  theory 
axioms,  115 
p- value,  215 

Q 

QQ-plot,  44 
Quantile,  42,  137 
Quartile,  43,  137 
Quintile,  43 

R 

R2 

adjusted,  281 
Random  variables,  127 
bivariate,  140 
continuous,  129 
discrete,  131 
i.i.d,  153 

independence,  144 
standardization,  138 
Range,  49 

Real  stories,  63,  124,  176,  244,  245 
Realization,  128 
Reference  category,  266 
Regression 
line,  25 1 
linear,  249 
multiple  linear,  262 
polynomial,  267 
Regressor,  251 
Relationship,  249 
Relative  risk,  78 
Residuals,  253,  270 
standardized,  285 
Response,  25 1 
R  packages 
compositions,  219 
epitools,  202 
foreign,  12 
ggplot2,  35,  73,  199 
ineq,  59,  62 
lattice,  73 
MASS,  28,  283 
mvtnorm,  316 
ryouready,  87,  88 
TeachingDemos,  445 


vcd,  77 

S 

Sample 
estimate,  182 
pooled  variance,  222 
space,  110 
variance,  5 1 
Sampling 

without  replacement,  163 
Scale 

absolute,  7 
continuous,  6 
interval,  7 
nominal,  6 
ordinal,  6 
ratio,  7 

Scatter  plot,  80 
Set 

ordered,  100 
unordered,  100 
Significance  level,  213 
Simple  event,  110 
Slope,  251 

Standard  deviation,  51,  136 
Standard  error,  189,  205 
Standardization,  54,  138 
Standard  normal  distribution,  166 
Statistic,  146,  183 
Step  function,  20,  133 
Stochastic  convergence,  425 
Stuart’s  rc,  87 
Sufficiency,  190 
Sure  event,  110 

T 

Table 

contingency,  68,  69,  140 
frequency,  19,  69 

rc 

of  Stuart,  87 
T-distribution,  172,  427 
Test 

ANOVA,  274 
Binomial,  227 
X2  for  variance,  232,  430 
X2  goodness  of  fit,  235 
X2  independence,  238 
X2  of  homogeneity,  240 
duality,  216 
equivalence,  213 
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F,  272,  43 1 
Fisher,  230,  428 
Friedman,  446 
Kolmogorov-Smirnov,  237 
Kruskal- Wallis,  446 
Mann- Whitney  U,  232 
McNemar,  447 
Mood,  446 

one- sample  Gauss  test,  216 

one- sample  t-test,  219 

one-sided,  211 

overall  F,  272 

paired  t-test,  225 

sign,  446 

significance,  213 

two- sample  binomial,  230 

two-sample  Gauss  test,  221 

two- sample  t-test,  222 

two-sided,  211 

U,  232 

Welch,  222 

Wilcoxon  rank  sum,  232 
Wilcoxon-Mann- Whitney,  232 
Test  distributions,  317 
Theorem 

additivity,  116,  147 
additivity  of  x2  variables,  172 
additivity  of  disjoint  events,  115 
Bayes,  120 
central  limit,  426 
i.i.d.,  185 

large  numbers,  425,  426 
law  of  total  probability,  119 
multiplication  for  probabilities,  119 
Neyman-Fisher  Factorization,  191 
PDF,  129 

standardization,  138 
Student,  172 
Tschebyschev,  139 
variance  decomposition,  136 
Tie,  85,  86 
Transformation,  11 
T-test 


one- sample,  219 
paired,  225 
two-sample,  222 

Two-sample  problem,  210 

U 

Uniform  distribution 
continuous,  165 
discrete,  154 

Unit,  3 

Unordered  set,  100 

U  test,  232 

V 

Variable,  4 
binary,  7 

bivariate  random,  140 
categorical,  7 
continuous,  6 
dependent,  25 1 
discrete,  6 
dummy,  266 
grouped,  7 
independent,  25 1 
nominal,  6 
ordinal,  6 
random,  127 
response,  25 1 
standardized,  54 

Variance,  51,  135 
additivity  theorem,  147 
between  classes,  53 
calculation  rules,  144 
decomposition,  257 
dispersion,  48 
pooled,  222 
within  classes,  53 

Vcd,  see  R  packages 

W 

Welch  test,  222 

Whiskers,  56 

Wilcoxon-Mann- Whitney  test,  232 


